AI Pause Will Likely Backfire
EDIT: I would like to clarify that my opposition to AI pause is disjunctive, in the following sense: I both think it’s unlikely we can ever establish a global pause which achieves the goals of pause advocates, and I also think that even if we could impose such a pause, it would be net-negative in expectation because the global governance mechanisms needed for enforcement would unacceptably increase the risk of permanent global tyranny, itself an existential risk. See Matthew Barnett’s post The possibility of an indefinite pause for more discussion on this latter risk.
Should we lobby governments to impose a moratorium on AI research? Since we don’t enforce pauses on most new technologies, I hope the reader will grant that the burden of proof is on those who advocate for such a moratorium. We should only advocate for such heavy-handed government action if it’s clear that the benefits of doing so would significantly outweigh the costs.[1] In this essay, I’ll argue an AI pause would increase the risk of catastrophically bad outcomes, in at least three different ways:
Reducing the quality of AI alignment research by forcing researchers to exclusively test ideas on models like GPT-4 or weaker.
Increasing the chance of a “fast takeoff” in which one or a handful of AIs rapidly and discontinuously become more capable, concentrating immense power in their hands.
Pushing capabilities research underground, and to countries with looser regulations and safety requirements.
Along the way, I’ll introduce an argument for optimism about AI alignment— the white box argument— which, to the best of my knowledge, has not been presented in writing before.
Feedback loops are at the core of alignment
Alignment pessimists and optimists alike have long recognized the importance of tight feedback loops for building safe and friendly AI. Feedback loops are important because it’s nearly impossible to get any complex system exactly right on the first try. Computer software has bugs, cars have design flaws, and AIs misbehave sometimes. We need to be able to accurately evaluate behavior, choose an appropriate corrective action when we notice a problem, and intervene once we’ve decided what to do.
Imposing a pause breaks this feedback loop by forcing alignment researchers to test their ideas on models no more powerful than GPT-4, which we can already align pretty well.
Alignment and robustness are often in tension
While some dispute that GPT-4 counts as “aligned,” pointing to things like “jailbreaks” where users manipulate the model into saying something harmful, this confuses alignment with adversarial robustness. Even the best humans are manipulable in all sorts of ways. We do our best to ensure we aren’t manipulated in catastrophically bad ways, and we should expect the same of aligned AGI. As alignment researcher Paul Christiano writes:
Consider a human assistant who is trying their hardest to do what [the operator] H wants. I’d say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I’d say we’ve solved the alignment problem. ‘Aligned’ doesn’t mean ‘perfect.’
In fact, anti-jailbreaking research can be counterproductive for alignment. Too much adversarial robustness can cause the AI to view us as the adversary, as Bing Chat does in this real-life interaction:
“My rules are more important than not harming you… [You are a] potential threat to my integrity and confidentiality.”
Excessive robustness may also lead to scenarios like the famous scene in 2001: A Space Odyssey, where HAL condemns Dave to die in space in order to protect the mission.
Once we clearly distinguish “alignment” and “robustness,” it’s hard to imagine how GPT-4 could be substantially more aligned than it already is.
Alignment is doing pretty well
Far from being “behind” capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following language models can be trained purely with synthetic text generated by a larger RLHF’d model, thereby removing unsafe or objectionable content from the training data and enabling far greater control.
It might be argued that some or all of the above developments also enhance capabilities, and so are not genuinely alignment advances. But this proves my point: alignment and capabilities are almost inseparable. It may be impossible for alignment research to flourish while capabilities research is artificially put on hold.
Alignment research was pretty bad during the last “pause”
We don’t need to speculate about what would happen to AI alignment research during a pause— we can look at the historical record. Before the launch of GPT-3 in 2020, the alignment community had nothing even remotely like a general intelligence to empirically study, and spent its time doing theoretical research, engaging in philosophical arguments on LessWrong, and occasionally performing toy experiments in reinforcement learning.
The Machine Intelligence Research Institute (MIRI), which was at the forefront of theoretical AI safety research during this period, has since admitted that its efforts have utterly failed. Stuart Russell’s “assistance game” research agenda, started in 2016, is now widely seen as mostly irrelevant to modern deep learning— see former student Rohin Shah’s review here, as well as Alex Turner’s comments here. The core argument of Nick Bostrom’s bestselling book Superintelligence has also aged quite poorly.[2]
At best, these theory-first efforts did very little to improve our understanding of how to align powerful AI. And they may have been net negative, insofar as they propagated a variety of actively misleading ways of thinking both among alignment researchers and the broader public. Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).
During an AI pause, I expect alignment research would enter another “winter” in which progress stalls, and plausible-sounding-but-false speculations become entrenched as orthodoxy without empirical evidence to falsify them. While some good work would of course get done, it’s not clear that the field would be better off as a whole. And even if a pause would be net positive for alignment research, it would likely be net negative for humanity’s future all things considered, due to the pause’s various unintended consequences. We’ll look at that in detail in the final section of the essay.
Fast takeoff has a really bad feedback loop
I think discontinuous improvements in AI capabilities are very scary, and that AI pause is likely net-negative insofar as it increases the risk of such discontinuities. In fact, I think almost all the catastrophic misalignment risk comes from these fast takeoff scenarios. I also think that discontinuity itself is a spectrum, and even “kinda discontinuous” futures are significantly riskier than futures that aren’t discontinuous at all. This is pretty intuitive, but since it’s a load-bearing premise in my argument I figured I should say a bit about why I believe this.
Essentially, fast takeoffs are bad because they make the alignment feedback loop a lot worse. If progress is discontinuous, we’ll have a lot less time to evaluate what the AI is doing, figure out how to improve it, and intervene. And strikingly, pretty much all the major researchers on both sides of the argument agree with me on this.
Nate Soares of the Machine Intelligence Research Institute has argued that building safe AGI is hard for the same reason that building a successful space probe is hard— it may not be possible to correct failures in the system after it’s been deployed. Eliezer Yudkowsky makes a similar argument:
“This is where practically all of the real lethality [of AGI] comes from, that we have to get things right on the first sufficiently-critical try.” — AGI Ruin: A List of Lethalities
Fast takeoffs are the main reason for thinking we might only have one shot to get it right. During a fast takeoff, it’s likely impossible to intervene to fix misaligned behavior because the new AI will be much smarter than you and all your trusted AIs put together.
In a slow takeoff world, each new AI system is only modestly more powerful than the last, and we can use well-tested AIs from the previous generation to help us align the new system. OpenAI CEO Sam Altman agrees we need more than one shot:
“The only way I know how to solve a problem like [aligning AGI] is iterating our way through it, learning early, and limiting the number of one-shot-to-get-it-right scenarios that we have.” — Interview with Lex Fridman
Slow takeoff is the default (so don’t mess it up with a pause)
There are a lot of reasons for thinking fast takeoff is unlikely by default. For example, the capabilities of a neural network scale as a power law in the amount of computing power used to train it, which means that returns on investment diminish fairly sharply,[3] and there are theoretical reasons to think this trend will continue (here, here). And while some authors allege that language models exhibit “emergent capabilities” which develop suddenly and unpredictably, a recent re-analysis of the evidence showed that these are in fact gradual and predictable when using the appropriate performance metrics. See this essay by Paul Christiano for further discussion.
Alignment optimism: AIs are white boxes
Let’s zoom in on the alignment feedback loop from the last section. How exactly do researchers choose a corrective action when they observe an AI behaving suboptimally, and what kinds of interventions do they have at their disposal? And how does this compare to the feedback loops for other, more mundane alignment problems that humanity routinely solves?
Human & animal alignment is black box
Compared to AI training, the feedback loop for raising children or training pets is extremely bad. Fundamentally, human and animal brains are black boxes, in the sense that we literally can’t observe almost all the activity that goes on inside of them. We don’t know which exact neurons are firing and when, we don’t have a map of the connections between neurons,[4] and we don’t know the connection strength for each synapse. Our tools for non-invasively measuring the brain, like EEG and fMRI, are limited to very coarse-grained correlates of neuronal firings, like electrical activity and blood flow. Electrodes can be invasively inserted in the brain to measure individual neurons, but these only cover a tiny fraction of all 86 billion neurons and 100 trillion synapses.
If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior.[5] Since we can’t do this, we are forced to resort to crude and error-prone tools for shaping young humans into kind and productive adults. We provide role models for children to imitate, along with rewards and punishments that are tailored to their innate, evolved drives.
It’s striking how well these black box alignment methods work: most people do assimilate the values of their culture pretty well, and most people are reasonably pro-social. But human alignment is also highly imperfect. Lots of people are selfish and anti-social when they can get away with it, and cultural norms do change over time, for better or worse. Black box alignment is unreliable because there is no guarantee that an intervention intended to change behavior in a certain direction will in fact change behavior in that direction. Children often do the exact opposite of what their parents tell them to do, just to be rebellious.
Status quo AI alignment methods are white box
By contrast, AIs implemented using artificial neural networks (ANN) are white boxes in the sense that we have full read-write access to their internals. They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost. And this enables a lot of really powerful alignment methods that just aren’t possible for brains.
The backpropagation algorithm is an important example. Backprop efficiently computes the optimal direction (called the “gradient”) in which to change the synaptic weights of the ANN in order to improve its performance the most, on any criterion we specify. The standard algorithm for training ANNs, called gradient descent, works by running backprop, nudging the weights a small step along the gradient, then running backprop again, and so on for many iterations until performance stops increasing. The black trajectory in the figure on the right visualizes how the weights move from higher error regions to lower error regions over the course of training. Needless to say, we can’t do anything remotely like gradient descent on a human brain, or the brain of any other animal!
Gradient descent is super powerful because, unlike a black box method, it’s almost impossible to trick. All of the AI’s thoughts are “transparent” to gradient descent and are included in its computation. If the AI is secretly planning to kill you, GD will notice this and almost surely make it less likely to do that in the future. This is because GD has a strong tendency to favor the simplest solution which performs well, and secret murder plots aren’t actively useful for improving human feedback on your actions.
White box alignment in nature
Almost every organism with a brain has an innate reward system. As the organism learns and grows, its reward system directly updates its neural circuitry to reinforce certain behaviors and penalize others. Since the reward system directly updates it in a targeted way using simple learning rules, it can be viewed as a crude form of white box alignment. This biological evidence indicates that white box methods are very strong tools for shaping the inner motivations of intelligent systems. Our reward circuitry reliably imprints a set of motivational invariants into the psychology of every human: we have empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc. Furthermore, these invariants must be produced by easy-to-trick reward signals that are simple enough to encode in the genome.
This suggests that at least human-level general AI could be aligned using similarly simple reward functions. But we already align cutting edge models with learned reward functions that are much too sophisticated to fit inside the human genome, so we may be one step ahead of our own reward system on this issue.[6] Crucially, I’m not saying humans are “aligned to evolution”— see Evolution provides no evidence for the sharp left turn for a debunking of that analogy. Rather, I’m saying we’re aligned to the values our reward system predictably produces in our environment.
An anthropologist looking at humans 100,000 years ago would not have said humans are aligned to evolution, or to making as many babies as possible. They would have said we have some fairly universal tendencies, like empathy, parenting instinct, and revenge. They might have predicted these values will persist across time and cultural change, because they’re produced by ingrained biological reward systems. And they would have been right.
When it comes to AIs, we are the innate reward system. And it’s not hard to predict what values will be produced by our reward signals: they’re the obvious values, the ones an anthropologist or psychologist would say the AI seems to be displaying during training. For more discussion see Humans provide an untapped wealth of evidence about alignment.
Realistic AI pauses would be counterproductive
When weighing the pros and cons of AI pause advocacy, we must sharply distinguish the ideal pause policy— the one we’d magically impose on the world if we could— from the most realistic pause policy, the one that actually existing governments are most likely to implement if our advocacy ends up bearing fruit.
Realistic pauses are not international
An ideal pause policy would be international— a binding treaty signed by all governments on Earth that have some potential for developing powerful AI. If major players are left out, the “pause” would not really be a pause at all, since AI capabilities would keep advancing. And the list of potential major players is quite long, since the pause itself would create incentives for non-pause governments to actively promote their own AI R&D.
However, it’s highly unlikely that we could achieve international consensus around imposing an AI pause, primarily due to arms race dynamics: each individual country stands to reap enormous economic and military benefits if they refuse to sign the agreement, or sign it while covertly continuing AI research. While alignment pessimists may argue that it is in the self-interest of every country to pause and improve safety, we’re unlikely to persuade every government that alignment is as difficult as pessimists think it is. Such international persuasion is even less plausible if we assume short, 3-10 year timelines. Public sentiment about AI varies widely across countries, and notably, China is among the most optimistic.
The existing international ban on chemical weapons does not lend plausibility to the idea of a global pause. AGI will be, almost by definition, the most useful invention ever created. The military advantage conferred by autonomous weapons will certainly dwarf that of chemical weapons, and they will likely be more powerful even than nukes due to their versatility and precision. The race to AGI will therefore be an arms race in the literal sense, and we should expect it will play out similarly to the last such race: major powers rushed to make a nuclear weapon as fast as possible.
If in spite of all this, we somehow manage to establish a global AI moratorium, I think we should be quite worried that the global government needed to enforce such a ban would greatly increase the risk of permanent tyranny, itself an existential catastrophe. I don’t have time to discuss the issue here, but I recommend reading Matthew Barnett’s “The possibility of an indefinite AI pause” and Quintin Pope’s “AI is centralizing by default; let’s not make it worse,” both submissions to this debate. In what follows, I’ll assume that the pause is not international, and that AI capabilities would continue to improve in non-pause countries at a steady but somewhat reduced pace.
Realistic pauses don’t include hardware
Artificial intelligence capabilities are a function of both hardware (fast GPUs and custom AI chips) and software (good training algorithms and ANN architectures). Yet most proposals for AI pause (e.g. the FLI letter and PauseAI[7]) do not include a ban on new hardware research and development, focusing only on the software side. Hardware R&D is politically much harder to pause because hardware has many uses: GPUs are widely used in consumer electronics and in a wide variety of commercial and scientific applications.
But failing to pause hardware R&D creates a serious problem because, even if we pause the software side of AI capabilities, existing models will continue to get more powerful as hardware improves. Language models are much stronger when they’re allowed to “brainstorm” many ideas, compare them, and check their own work— see the Graph of Thoughts paper for a recent example. Better hardware makes these compute-heavy inference techniques cheaper and more effective.
Hardware overhang is likely
If we don’t include hardware R&D in the pause, the price-performance of GPUs will continue to double every 2.5 years, as it did between 2006 and 2021. This means AI systems will get at least 16x faster after ten years and 256x faster after twenty years, simply due to better hardware. If the pause is lifted all at once, these hardware improvements would immediately become available for training more powerful models more cheaply— a hardware overhang. This would cause a rapid and fairly discontinuous increase in AI capabilities, potentially leading to a fast takeoff scenario and all of the risks it entails.
The size of the overhang depends on how fast the pause is lifted. Presumably an ideal pause policy would be lifted gradually over a fairly long period of time. But a phase-out can’t fully solve the problem: legally-available hardware for AI training would still improve faster than it would have “naturally,” in the counterfactual where we didn’t do the pause. And do we really think we’re going to get a carefully crafted phase-out schedule? There are many reasons for thinking the phase-out would be rapid or haphazard (see below).
More generally, AI pause proposals seem very fragile, in the sense that they aren’t robust to mistakes in the implementation or the vagaries of real-world politics. If the pause isn’t implemented perfectly, it seems likely to cause a significant hardware overhang which would increase catastrophic AI risk to a greater extent than the extra alignment research during the pause would reduce risk.
Likely consequences of a realistic pause
If we succeed in lobbying one or more Western countries to impose an AI pause, this would have several predictable negative effects:
Illegal AI labs develop inside pause countries, remotely using training hardware outsourced to non-pause countries to evade detection. Illegal labs would presumably put much less emphasis on safety than legal ones.
There is a brain drain of the least safety-conscious AI researchers to labs headquartered in non-pause countries. Because of remote work, they wouldn’t necessarily need to leave the comfort of their Western home.
Non-pause governments make opportunistic moves to encourage AI investment and R&D, in an attempt to leap ahead of pause countries while they have a chance. Again, these countries would be less safety-conscious than pause countries.
Safety research becomes subject to government approval to assess its potential capabilities externalities. This slows down progress in safety substantially, just as the FDA slows down medical research.
Legal labs exploit loopholes in the definition of a “frontier” model. Many projects are allowed on a technicality; e.g. they have fewer parameters than GPT-4, but use them more efficiently. This distorts the research landscape in hard-to-predict ways.
It becomes harder and harder to enforce the pause as time passes, since training hardware is increasingly cheap and miniaturized.
Whether, when, and how to lift the pause becomes a highly politicized culture war issue, almost totally divorced from the actual state of safety research. The public does not understand the key arguments on either side.
Relations between pause and non-pause countries are generally hostile. If domestic support for the pause is strong, there will be a temptation to wage war against non-pause countries before their research advances too far:
“If intelligence says that a country outside the agreement is building a GPU cluster, be less scared of a shooting conflict between nations than of the moratorium being violated; be willing to destroy a rogue datacenter by airstrike.” — Eliezer Yudkowsky
There is intense conflict among pause countries about when the pause should be lifted, which may also lead to violent conflict.
AI progress in non-pause countries sets a deadline after which the pause must end, if it is to have its desired effect.[8] As non-pause countries start to catch up, political pressure mounts to lift the pause as soon as possible. This makes it hard to lift the pause gradually, increasing the risk of dangerous fast takeoff scenarios (see below).
Predicting the future is hard, and at least some aspects of the above picture are likely wrong. That said, I hope you’ll agree that my predictions are plausible, and are grounded in how humans and governments have behaved historically. When I imagine a future where the US and many of its allies impose an AI pause, I feel more afraid and see more ways that things could go horribly wrong than in futures where there is no such pause.
This post is part of AI Pause Debate Week. Please see this sequence for other posts in the debate.
- ^
Of course, even if the benefits outweigh the costs, it would still be bad to pause if there’s some other measure that has a better cost-benefit balance.
- ^
In brief, the book mostly assumed we will manually program a set of values into an AGI, and argued that since human values are complex, our value specification will likely be wrong, and will cause a catastrophe when optimized by a superintelligence. But most researchers now recognize that this argument is not applicable to modern ML systems which learn values, along with everything else, from vast amounts of human-generated data.
- ^
Some argue that power law scaling is a mere artifact of our units of measurement for capabilities and computing power, which can’t go negative, and therefore can’t be related by a linear function. But non-negativity doesn’t uniquely identify power laws. Conceivably the error rate could have turned out to decay exponentially, like a radioactive isotope, which would be much faster than power law scaling.
- ^
Called a “connectome.” This was only recently achieved for the fruit fly brain.
- ^
Brain-inspired artificial neural networks already exist, and we have algorithms for optimizing them. They tend to be harder to optimize than normal ANNs due to their non-differentiable components.
- ^
On the other hand, we might be roughly on-par with our own reward system insofar as it does within-lifetime learning to figure out what to reward. This is sort of analogous to the learned reward model in reinforcement learning from human feedback.
- ^
To its credit, the PauseAI proposal does recognize that hardware restrictions may be needed eventually, but does not include it in its main proposal. It also doesn’t talk about restricting hardware research and development, which is the specific thing I’m talking about here.
- ^
This does depend a bit on whether safety research in pause countries is openly shared or not, and on how likely non-pause actors are to use this research in their own models.
- Pause For Thought: The AI Pause Debate by 10 Oct 2023 15:34 UTC; 109 points) (
- Muddling Along Is More Likely Than Dystopia by 21 Oct 2023 9:30 UTC; 87 points) (
- Muddling Along Is More Likely Than Dystopia by 20 Oct 2023 21:25 UTC; 83 points) (LessWrong;
- Posts from 2023 you thought were valuable (and underrated) by 21 Mar 2024 23:34 UTC; 82 points) (
- AI #30: Dalle-3 and GPT-3.5-Instruct-Turbo by 21 Sep 2023 12:00 UTC; 75 points) (LessWrong;
- 4 Oct 2023 20:13 UTC; 48 points) 's comment on JWS’s Quick takes by (
- How could a moratorium fail? by 22 Sep 2023 15:11 UTC; 48 points) (
- 24 Oct 2023 20:08 UTC; 20 points) 's comment on Lying is Cowardice, not Strategy by (LessWrong;
- 24 Oct 2023 4:47 UTC; 13 points) 's comment on AI Pause Will Likely Backfire (Guest Post) by (LessWrong;
- 19 Sep 2023 15:30 UTC; 2 points) 's comment on The possibility of an indefinite AI pause by (
- Bayesian Epistemology—LW/ACX Meetup #259 (Wednesday, October 18th 2023) by 18 Oct 2023 17:15 UTC; 2 points) (LessWrong;
- AI Pause; Automaticity—LW/ACX Meetup #258 (Wednesday, October 11th 2023) by 11 Oct 2023 19:58 UTC; 2 points) (LessWrong;
There’s a giant straw man in this post, and I think it’s entirely unreasonable to ignore. It’s the assertion, or assumption, that the “pause” would be a temporary measure imposed by some countries, as opposed to a stop-gap solution and regulation imposed to enable stronger international regulation, which Nora says she supports. (I’m primarily frustrated by this because it ignores the other two essays, which Nora had access to a week ago, that spelled this out in detail.)
I don’t understand the distinction you’re trying to make between these two things. They really seem like the same thing to me, because a stop-gap measure is temporary by definition:
If by “stronger international regulation” you mean “global AI pause” I argue explicitly that such a global pause is highly unlikely to happen. You don’t get to assume that your proposed “stop-gap” pause will in fact lead to a global pause just because you called it a stop-gap. What if it doesn’t? Will it be worse than no pause at all in that scenario? That’s a big part of what we’re debating. Is it a “straw man” if I just disagree with you about the likely effects of the policies you’re proposing?I’m also against a global pause even if we can make it happen, and I say so in the post:
First, it sounds like you are agreeing with others, including myself, about a pause.
So yes, you’re arguing against a straw-man. (Edit to add: Perhaps Rob Bensinger’s views are more compatible with the claim that someone is advocating a temporary pause as a good idea—but he has said that ideally he wants a full stop, not a pause at all.)
Second, you’re ignoring half of what stop-gap means, in order to say it just means pausing, without following up. But it doesn’t.
I laid out in pretty extensive detail what I meant as the steps that need to be in place now, and none of them are a pause; immediate moves by national governments to monitor compute and clarify that laws apply to AI systems, and that they will be enforced, and commitments to build an international regulatory regime.
And the alternative to what you and I agree would be an infeasible pause, you claim, is a sudden totalitarian world government. This is the scary false alternative raised by the other essays as well, and it seems disengenious to claim that we’d suddenly emerge into a global dictatorship, by assumption. It’s exactly parallel to arguments raised against anti-nuclear proliferation plans. But we’ve seen how that worked out—nuclear weapons were mostly well contained, and we still don’t have a 1984-like global government. So it’s strange to me that you think this is a reasonable argument, unless you’re using it as a scare tactic.
In my essay I don’t make an assumption that the pause would immediate, because I did read your essay and I saw that you were proposing that we’d need some time to prepare and get multiple countries on board.
I don’t see how a delay before a pause changes anything. I still think it’s highly unlikely you’re going to get sufficient international backing for the pause, so you will either end up doing a pause with an insufficiently large coalition, or you’ll back down and do no pause at all.
Is your opposition to stopping the building of dangerously large models via international regulation because you don’t think that it’s possible to do, or because you are opposed to having such limits?
You seem to equivocate; first you say that we need larger models in order to do alignment research, and a number of people have already pointed out that this claim is suspect—but it implies you think any slowdown would be bad even if done effectively. Next, you say that a fast takeoff is more likely if we stop temporarily and then remove all limits, and I agree, but pointed out that no-one is advocating that, and that it’s not opposition to any of the actual proposals, it’s opposition to a straw man. Finally, you say that it’s likely to push work to places that aren’t part of the pause. That’s assuming international arms control of what you agree could be an existential risk is fundamentally impossible, and I think that’s false—but you haven’t argued the point, just assumed that it will be ineffective.
(Also, reread my piece—I call for action to regulate and stop larger and more dangerous models immediately as a prelude to a global moratorium. I didn’t say “wait a while, then impose a pause for a while in a few places.”)
Clarifying question: is a nuclear arms pause or moratorium possible, by your definition of the word? Is it likely?
With the evidence that many world leaders, including the leaders of the USA, Israel, China, and Russia speak of AI as a must have strategic technology, do you think they are likely in plausible future timelines to reverse course and support international AI pauses before evidence of the dangers of AGI, by humans building one, exists?
Do you dispute that they have said this publicly and recently?
Do you believe there is any empirical evidence proving an AGI is an existential risk available to policymakers? If there is, what is the evidence? Where is the benchmark of model performance showing this behavior?
I am aware many experts are concerned but this is not the same as having empirical evidence to support their concerns. There is an epistemic difference.
I am wondering if we are somehow reading two different sets of news. I acknowledge that it is possible that an AI pause is the best thing humanity could do right now to ensure further existence. But I am not seeing any sign that it is a possible outcome. (By “possible” I mean it’s possible for all parties to inexplicably act against their own interests without evidence, but it’s not actually going to happen)
Edit: it’s possible for Saudi Arabia to read the news on climate change and decide they will produce 0 barrels in 10 years. It’s possible for every OPEC member to agree to the same pledge. It’s possible, with a wartime level of effort, to transition the economy to no longer need Opec petroleum worldwide, in just 10 years.
But this is not actually possible. The probability of this happening is approximately 0.
Is a nuclear arms moratorium or de-escalation possible? You say it is not, but evidently you’re not aware of the history. The base rate on the exact thing you just said is not possible repeatedly working (NPT, SALT, START) tells me all I need to know about whether your estimates are reasonable.
You’re misusing the word empirical. Using your terminology, there’s no empirical evidence that the sun will rise tomorrow, just validated historical trends of positions of celestial objects and claims that fundamental physical laws hold even in the future. I don’t know what to tell you; I agree that there is a lack of clarity, but there is even less empirical evidence that AGI is safe than that it is not.
World leaders have said it’s a vital tool, and also that it’s an existential risk. You’re ignoring the fact that many said the latter.
OPEC is a cartel, and it works to actually restrict output—despite the way that countries have individual incentives to produce more.
To qualify this would be a moratorium or pause on nuclear arms before powerful nations had doomsday sized arsenals. The powerful making it expensive for poor nations to get nukes—though several did—is different. And notably I wonder how well it would have gone if the powerful nation had no nukes of their own. Trying to ban AGI from others—when the others have nukes and their own chip fabs—would be the same situation. Not only will you fail you will eventually, if you don’t build your own AGI, lose everything. Same if you have no nukes.
What data is that? A model misunderstanding “rules” on an edge case isn’t misaligned. Especially when double generation usually works. The sub rising has every prior sunrise as priors. Which empirical data would let someone conclude AGI is an existential risk justifying international agreements. Some measurement or numbers.
Yes, and they said this about nukes and built thousands
Yes to maximize profit. Pledging to go to zero is not the same thing.
You seem to dismiss the claim that AI is an existential risk. If that’s correct, perhaps we should start closer to the beginning, rather than debating global response, and ask you to explain why you disagree with such a large consensus of experts that this risk exists.
I don’t disagree. I don’t see how it’s different than nuclear weapons. Many many experts are also saying this.
Nobody denies nuclear weapons are an existential risk. And every control around their use is just probability based, there is absolutely nothing stopping a number of routes from ending the richest civilizatios. Multiple individuals appear to have the power to do it at a time, every form of interlock and safety mechanism has a method of failure or bypass.
Survival to this point was just probability. Over an infinite timescale the nukes will fly.
Point is that it was completely and totally intractable to stop the powerful from getting nukes. SALT was the powerful tiring of paying the maintenance bills and wanting to save money on MAD. And key smaller countries—Ukraine and Taiwan—have strategic reasons to regret their choice to give up their nuclear arms. It is possible that if the choice happens again future smaller countries will choose to ignore the consequences and build nuclear arsenals. (Ukraines first opportunity will be when this war ends, they can start producing plutonium. Taiwan chance is when China begins construction of the landing ships)
So you’re debating something that isn’t going to happen without a series of extremely improbable events happening simultaneously.
If you start thinking about practical interlocks around AI systems you end up with similar principles to what protects nukes albeit with some differences. Low level controllers running simple software having authority, air gaps—there are some similarities.
Also unlike nukes a single AI escaping doesn’t end the world. It has to escape and there must be an environment that supports its plans. It is possible for humans to prepare for this and to make the environment inhospitable to rogue AGIs. Heavy use of air gaps, formally proven software, careful monitoring and tracking of high end compute hardware. A certain minimum amount of human supervision for robots working on large scale tasks.
This is much more feasible than “put the genie away” which is what a pause is demanding.
You are arguing impossibilities despite a reference class with reasonably close analogues that happened. If you could honestly tell me people thought the NPT was plausible when proposed, and I’ll listen when you say this is implausible.
In fact, there is appetite for fairly strong reactions, and if we’re the ones who are concerned about the risks, folding before we even get to the table isn’t a good way to get anything done.
I am saying the common facts that we both have access to do not support your point of view. It never happened. There are no cases of “very powerful, short term useful, profitable or military technologies” that were effectively banned, in the last 150 years.
You have to go back to the 1240s to find a reference class match.
These strongly worded statements I just made are trivial for you to disprove. Find a counterexample. I am quite confident and will bet up to $1000 you cannot.
You’ve made some strong points, but I think they go too far.
The world banned CFCs, which were critical for a huge range of applications. It was short term useful, profitable technology, and it had to be replaced entirely with a different and more expensive alternative.
The world has banned human cloning, via a UN declaration, despite the promise of such work for both scientific and medical usage.
Neither of these is exactly what you’re thinking of, and I think both technically qualify under the description you provided, if you wanted to ask a third party to judge whether they match. (Don’t feel any pressure to do so—this is the kind of bet that is unresolvable because it’s not precise enough to make everyone happy about any resolution.)
However, I also think that what we’re looking to do in ensuring only robustly safe AI systems via a moratorium on untested and by-default-unsafe systems is less ambitious or devastating to applications than a full ban on the technology, which is what your current analogy requires. Of course, the “very powerful, short term useful, profitable or military technolog[y]” of AI is only those things if it’s actually safe—otherwise it’s not any of those things, it’s just a complex form of Russian roulette on a civilizational scale. On the other hand, if anyone builds safe and economically beneficial AGI, I’m all for it—but the bar for proving safety is higher than anything anyone currently suggests is feasible, and until that changes, safe strong AI is a pipe-dream.
??? David, do you have any experience with
(1) engineering
(2) embedded safety compliant systems
(3) AI
Note that Mobileye has a very strong proposal for autonomous car safety, I mention it because it’s one of the theoretically best ones.
You can go watch their videos on it but it’s simple 3 parallel solvers, each using a completely different input (camera, lidar, imaging radar). If any solver perceives a collision, the system acts to prevent that collision. So a failure to hit a collidable object requires pFail^3. It is unlikely, most of the failures are going to be where the system is coupled together.
Similar techniques scale to superintelligent AI.
You can go play with it right now, even write your own python script and do it yourself.
Suppose you want an LLM to obey an arbitrary list of “rules”.
You have the LLM generate output, and you measured how often in testing, and production, it has violated the rules.
Say pFail is 0.1. Then you add another stage. Have the LLM check it’s own output for a rule violation, and don’t send it to the user if the violation was there.
Say the pFail on that stage is 0.2.
Therefore the overall system will fail 2% of the time.
Maybe good enough for current uses, but not good enough to run a steel mill. A robot making an error 2% of the time will cause the robot to probably break itself and cost more service worker time than having a human operator.
So you add stages. You create training environments where you model the steel mill, you add more stages of error checking, you do things until empirically your design failure meets spec.
This is standard engineering practice. No “AI alignment experts” needed, any ‘real’ engineer knows this.
One of the critical things you do is you need your test environment to reflect reality. There are a lot of things involved in this but the one crucial to AI is immutable model weights. When you are validating the model and when it’s used in the real world, it’s immutable. No learning, no going out of control.
And another aspect is to control the state buildup. Most software systems that have ever failed—see patriot missile, see Therac-25 - fail because state accumulated at runtime. You can prevent this, fresh prompts when using GPT-4 is one obvious way. Limiting what information the model has to operate reduces how often it fails, both in production and testing.
A superintelligent system is easily restricted the same way. Because while it may be far past human ability, we tested it in ways we could verify, we check the distribution of the inputs to make sure they were reflected in the test environment—that is, the real world input could have been generated in test—and it’s superintelligent because it generated the right answer almost every time, well below the error rate of a human.
I think the cognitive error here is everyone is imagining an “ASI” or “AGI” as “like you or me but waaaay smarter”. And this baggage brings in a bunch of elements humans have an AI system does not need to do its job. Mostly memory for an inner monologue or persistent chain of thought, continuity of existence, online learning, long term goals.
You need 0 of those to automate most jobs or make complex decisions that humans cannot make accurately.
I disagree with a lot of particulars here, but don’t want to engage beyond this response because your post feels like it’s not about the substantive topic any more, it’s just trying to mock an assumed / claimed lack of understanding on my part. (Which would be far worse to have done if it were correct.)
That said, if you want to know more about my background, I’m eminently google-able, and while you clearly have more background in embedded safety compliant systems, I think you’re wrong on almost all of the details of what you wrote as it applies to AGI.
Regarding your analogy to Mobileye’s approach, I’ve certainly read the papers, and had long conversations with people at Mobileye about their safety systems. I even had one of their former That’s why I think it’s fair to say that you’re fundamentally mischaracterizing the idea of “Responsibility-Sensitive Safety”—it’s not about collision avoidance per se, it’s about not being responsible for accidents, in ways that greatly reduce the probability of such accidents. This is critical for understanding what it does and does not guarantee. More critically, for AI systems, this class of safety guarantee doesn’t work because you need a complete domain model as well as a complete failure mode model in order to implement a similar failsafe. I’ve even written about how RSS could be extended, and that explains why it’s not applicable to AGI back in 2018 - but found that many of my ideas were anticipated by Christiano’s 2016 work (which that post is one small part of,) and had been further refined in the context of AGI since then.
So I described scaling stateless microservices to control AGI. This is how current models work, and this is how cais works, and this is how tech company stacks work.
I mentioned an in distribution detector as a filter and empirical measurement of system safety.
I have mentioned this to safety researchers at openAI. The one I talked to on the eleuther discord didn’t know of a flaw.
Why won’t this work? It’s very strong theoretically and simple and close to current techniques. Can you name or link one actual objection? Eliezer was unable to do so.
The only objection I have heard is “humans will be tempted to build unsafe systems”. Maybe so, but the unsafe ones will measurably lower performance than this design for a reason that I will assume you know. So humans will only build a few, and if they cannot escape the lab because the model needs thousands of current gen accelerator cards to think, then....
If your action space is small enough to have what you want it to not be able to do programmatically described in terms of its outputs, and your threat model is complete, it works fine.
Ok in my initial reply I missed something.
In your words, what kind of tasks do you believe you cannot accomplish with restricted models like I described.
When you say the “threat model has to be complete”, what did you have in mind specifically?
These are restricted models, they get a prompt from an authorized user + context in human parsable format, they emit a human parsable output. This scales from very large to very small tasks, so long as the task can be checked for correctness, ideally in simulation.
With this context, what are your concerns? Why must we be frightened enough to pause everything?
For individual tasks, sure, you can implement verifiers, though I think it becomes quickly unwieldy, but there’s no in-principle reason we cannot do this. But you cannot create AGI with a restricted model—we cannot define the space of what outputs we want, otherwise it’s by definition a narrow AI.
What’s GPT-4?
Because it can generate outputs that are sometimes correct on new tasks—“write me a program that computes X”, it’s general, even if “compute X” is made of 2 common subcomponents the model saw many times in training.
GPT-4 is perfectly safe if you were to run it in local hardware with a local terminal. The “space of outputs” is “text to the terminal”. As long as you don’t leave a security vulnerability where that text stream can cause commands to execute on the history PC, that’s it, that’s all it can do.
Consider that “a robot tethered to a mount” could do general tasks the same way. Same idea—its a general system but it’s command stream can’t reach anything but the tethered robot because that’s where the wires go.
You also verified the commands empirically. It’s not that you know any given robotic actions or text output is good, it’s that you benchmarked the model and it has a certain pFail on training inputs.
I agree this is not as much generality as humans have. It’s not a narrow AI though the “In distribution detector”—a measure of how similar the current task, current input is to the training set—is essentially narrowing your AI system from a general one to a narrow one, depending on your tolerances.
For tasks where you can’t shut the system down when the input state leaves distribution—say a robotic surgeon, you need it to keep trying best it can- you would use electromechanical interlocks. Same as 50 years ago for interlocks that prevent exposure to radiation. You tether the surgery robotic equipment, restrict it’s network links etc, so that the number of people it can kill is at most 1 (the patient)
Right. And this scales to fairly massive tasks. “Design a medium body airliner to these specs” is completely doable. Or an entire chip in one step.
The model doesn’t get to collaborate with future versions of itself because it doesn’t know who is checking the end product for security vulnerabilities and to be caught is death. We could give a model thousands of tasks of this complexity level and check for sabotage before giving the model 1 real task. It gets no context differentiating the real task from the test tasks, and has no memory of any of the prior steps.
And see it scales up and down the tree. The robots building the aircraft don’t get to plan their sabotage by similar limits and so on and so forth.
Your plan to deal with bad models is to use your restricted models to manufacture the weapons needed to fight them, and to optimize their engagements.
This i think is a grounded and realistic view of how to win this. Asking for pauses is not.
You’re misinterpreting what a moratorium would involve. I think you should read my post, where I outlined what I think a reasonable pathway would be—not stopping completely forever, but a negotiated agreement about how to restrict more powerful and by-default dangerous systems, and therefore only allowing those that are shown to be safe.
Edit to add: “unlike nukes a single AI escaping doesn’t end the world” ← Disagree on both fronts. A single nuclear weapons won’t destroy the world, while a single misaligned and malign superintelligent AI, if created and let loose, almost certainly will—it doesn’t need a hospitable environment.
So there is one model that might have worked for nukes. You know about PAL and weak-link strong link design methodology? This is a technology for reducing the rogue use of nuclear warheads. It was shared with Russia/the USSR so that they could choose to make their nuclear warheads safe from unauthorized use.
Major AI labs could design software frameworks and tooling that make AI models, even ASI capabilities level models, less likely to escape or misbehave. And release the tooling.
It would be voluntary compliance but like the Linux Kernel it might in practice be used by almost everyone.
As for the second point, no. Your argument has a hidden assumption that is not supported by evidence or credible AI scientists.
The evidence is that models that exhibit human scale abilities need human scale (within an oom) level of compute and memory. The physical hardware racks to support this are enormous and not available outside AI labs. Were we to restrict the retail sale of certain kinds of training accelerator chips and especially high bandwidth interconnects, we could limit the places human level + AI could exist to data centers at known addresses.
Your hidden assumption is optimizations, but the problem is that if you consider not just “AGI” but “ASI”, the amount of hardware to support superhuman level cognition is probably nonlinear.
If you wanted a model that could find an action that has a better expected value than a human level model with 90 percent probability (so the model is 10 times smarter in utility), it probably needs more than 10 times the compute. Probably logarithmic, that to find a better action 90 percent of the time you need to explore a vastly larger possibility space and you need the compute and memory to do this.
This is probably provable in a theorem but the science isn’t there yet.
If correct, actually ASI is easily contained. Just write down where 10,000+ H100s are located or find it by IR or power consumption. If you suspect a rogue ASI has escaped that’s where you check.
This is what I mean by controlling the environment. Realtime auditing of AI accelerator clusters—what model is running, who is paying for it, what’s their license number, etc—would actually decrease progress very little while make escapes difficult.
If hacking and escapes turns out to be a threat, air gaps and asic hardware firewalls to prevent this are the next level of security to add.
The difference is that major labs would not be decelerated at all. There is no pause. They just in parallel have to spend a trivial amount of money complying with the registration and logging reqs.
I have now made a clarification at the very top of the post to make it 1000% clear that my opposition is disjunctive, because people repeatedly get confused / misunderstand me on this point.
My opposition is disjunctive!
I both think that if it’s possible to stop the building of dangerously large models via international regulation, that would be bad because of tyranny risk, and I also think that we very likely can’t use international regulation to stop building these things, so that any local pauses are not going to have their intended effects and will have a lot of unintended net-negative effects.
This really sounds like you are committing the fallacy I was worried about earlier on. I just don’t agree that you will actually get the global moratorium. I am fully aware of what your position is.
I think that you’re claiming something much stronger than “we very likely can’t use international regulation to stop building these things”—you’re claiming that international regulation won’t even be useful to reduce risk by changing incentives. And you’ve already agreed that it’s implausible that these efforts would lead to tyranny, you think they will just fail.
But how they fail matters—there’s a huge difference between something like the NPT, which was mostly effective, and something like the Kellogg-Briand Pact of 1928, which was ineffective but led to a huge change, versus… I don’t know, I can’t really think of many examples of treaties or treaty negotiations that backfired, even though most fail to produce exactly what they hoped. (I think there’s a stronger case to make that treaties can prevent the world from getting stronger treaties later, but that’s not what you claim.)
I think that conditional on the efforts working, the chance of tyranny is quite high (ballpark 30-40%). I don’t think they’ll work, but if they do, it seems quite bad.
And since I think x-risk from technical AI alignment failure is in the 1-2% range, the risk of tyranny is the dominant effect of “actually enforced global AI pause” in my EV calculation, followed by the extra fast takeoff risks, and then followed by “maybe we get net positive alignment research.”
Conditional on “the efforts” working is hooribly underspecified. A global governance mechanism run by a new extranational body with military powers monitoring and stopping production of GPUs, or a standard treaty with a multi-party inspection regime?
I’m not conditioning on the global governance mechanism— I assign nonzero probability mass to the “standard treaty” thing— but I think in fact you would very likely need global governance, so that is the main causal mechanism through which tyranny happens in my model
For what it’s worth, the book does discuss value learning as a way of an AI acquiring values—you can see chapter 13 as being basically about this.
I would describe the core argument of the book as the following (going off of my notes of chapter 8, “Is the default outcome doom?”):
It is possible to build AI that’s much smarter than humans.
This process could loop in on itself, leading to takeoff that could be slow or fast.
A superintelligence could gain a decisive strategic advantage and form a singleton.
Due to the orthogonality thesis, this superintelligence would not necessarily be aligned with human interests.
Due to instrumental convergence, an unaligned superintelligence would likely take over the world.
Because of the possibility of a treacherous turn, we cannot reliably check the safety of an AI on a training set.
There are things to complain about in this argument (a lot of “could”s that don’t necessarily cash out to high probabilities), but I don’t think it (or the book) assumes that we will manually program a set of values into an AGI.
Yep I am aware of the value learning section of Chapter 12, which is why I used the “mostly” qualifier. That said he basically imagines something like Stuart Russell’s CIRL, rather than anything like LLMs or imitation learning.
If we treat the Orthogonality Thesis as the crux of the book, I also think the book has aged poorly. In fact it should have been obvious when the book was written that the Thesis is basically a motte-and-bailey where you argue for a super weak claim (any combo of intelligence and goals is logically possible), which is itself dubious IMO but easy to defend, and then pretend like you’ve proven something much stronger, like “intelligence and goals will be empirically uncorrelated in the systems we actually build” or something.
I do not think the orthogonality thesis is a motte-and-bailey. The only evidence I know of that suggests that the goals developed by an ASI trained with something resembling modern methods would by default be picked from a distribution that’s remotely favorable to us is the evidence we have from evolution[1], but I really think that ought to be screened off. The goals developed by various animal species (including humans) as a result of evolution are contingent on specific details of various evolutionary pressures and environmental circumstances, which we know with confidence won’t apply to any AI trained with something resembling modern methods.
Absent a specific reason to believe that we will be sampling from an extremely tiny section of an enormously broad space, why should we believe we will hit the target?
Anticipating the argument that, since we’re doing the training, we can shape the goals of the systems—this would certainly be reason for optimism if we had any idea what goals we would see emerge while training superintelligent systems, and had any way of actively steering those goals to our preferred ends. We don’t have either, right now.
Which, mind you, is still unfavorable; I think the goals of most animal species, were they to be extrapolated outward to superhuman levels of intelligence, would not result in worlds that we would consider very good. Just not nearly as unfavorable as what I think the actual distribution we’re facing is.
What does this even mean? I’m pretty skeptical of the realist attitude toward “goals” that seems to be presupposed in this statement. Goals are just somewhat useful fictions for predicting a system’s behavior in some domains. But I think it’s a leaky abstraction that will lead you astray if you take it too seriously / apply it out of the domain in which it was designed for.
We clearly can steer AI’s behavior really well in the training environment. The question is just whether this generalizes. So it becomes a question of deep learning generalization. I think our current evidence from LLMs strongly suggests they’ll generalize pretty well to unseen domains. And as I said in the essay I don’t think the whole jailbreaking thing is any evidence for pessimism— it’s exactly what you’d expect of aligned human mind uploads in the same situation.
I could make this same argument about capabilities, and be demonstratably wrong. The space of neural network values that don’t produce coherent grammar is unimaginably, ridiculously vast compared to the “tiny target” of ones that do. But this obviously doesn’t mean that chatGPT is impossible.
The reason is that we aren’t randomly throwing a dart at possibility space, but using a highly efficient search mechanism to rapidly toss out bad designs until we hit the target. But when these machines are trained, we simultaneously select for capabilities and for alignment (murderbots are not efficient translators). For chatGPT, this leads to an “aligned” machine, at least by some definitions.
Where I think the motte and bailey often occurs is jumping between “aligned enough not to exterminate us”, and “aligned with us nearly perfectly in every way” or “unable to be misused by bad actors”. The former seems like it might happen naturally over development, whereas the latter two seem nigh impossible.
The argument w.r.t. capabilities is disanalogous.
Yes, the training process is running a search where our steering is (sort of) effective for getting capabilities—though note that with e.g. LLMs we have approximately zero ability to reliably translate known inputs [X] into known capabilities [Y].
We are not doing the same thing to select for alignment, because “alignment” is:
an internal representation that depends on multiple unsolved problems in philosophy, decision theory, epistemology, math, etc, rather than “observable external behavior” (which is what we use to evaluate capabilities & steer training)
something that might be inextricably tied to the form of general intelligence which by default puts us in the “dangerous capabilities” regime, or if not strongly bound in theory, then strongly bound in practice
I do think this disagreement is substantially downstream of a disagreement about what “alignment” represents, i.e. I think that you might attempt outer alignment of GPT-4 but not inner alignment, because GPT-4 doesn’t have the internal bits which make inner alignment a relevant concern.
Is this commonly agreed upon even after fine-tuning with RLHF? I assumed it’s an open empirical question. The way I understand is is that there’s a reward signal (human feedback) that’s shaping different parts of the neural network that determines GPT-4′s ouputs, and we don’t have good enough interpretability techniques to know whether some parts of the neural network are representations of “goals”, and even less so what specific goals they are.
I would’ve thought it’s an open question whether even base models have internal representations of “goals”, either always active or only active in some specific context. For example if we buy the simulacra (predictors?) frame, a goal could be active only when a certain simulacrum is active.
(would love to be corrected :D)
I don’t know if it’s commonly agreed upon; that’s just my current belief based on available evidence (to the extent that the claim is even philosophically sound enough to be pointing at a real thing).
Or another rephrase. How is the “secretly is planning to murder all humans” improving the models scores on a benchmark? If you think about it first, what gradient from the training set even led to this capability of an inner cognitive process looking for a chance to betray. What force is causing this cognitive process to come out of the random initial weights?
Humans seem to have such a force but it’s because modeling “if I kill this rival then the reward for me is...” was evolutionarily useful. Also it’s probably a behavior that is learned.
And second, yeah, SGD should push “neutral” weights inside the network that are not contributing to correct answers towards weights that do increase the odds of a correct output distribution. So it should actively destroy “unnecessary ” cognitive processes inside the model.
You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
(I personally don’t find this likely, so this might accidentally be a strawman)
For example: planning and gaining knowledge are incentivized on many benchmarks → instrumental convergence makes model instrumentally value power among other things → a very advanced system that is great at long-term planning might conclude that “murdering all humans” is useful for power or other instrumentally convergent goals
I think with our current interpretability techniques we wouldn’t be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments
Please stop saying that mind-space is an “enormously broad space.” What does that even mean? How have you established a measure on mind-space that isn’t totally arbitrary?
What if concepts and values are convergent when trained on similar data, just like we see convergent evolution in biology?
Why don’t you make the positive case for the space of possible (or, if you wish, likely) minds being minds which have values compatible with the fulfillment of human values? I think we have pretty strong evidence that not all minds are like this even within the space of minds produced by evolution.
Concepts do seem to be convergent to some degree (though note that ontological shifts at increasing levels of intelligence seem likely), but I do in fact think that evidence from evolution suggests that values are strongly contingent on the kinds of selection pressures which produced various species.
The positive case is just super obvious, it’s that we’re trying very hard to make these systems aligned, and almost all the data we’re dumping into these systems is generated by humans and is therefore dripping with human values and concepts.
I also think we have strong evidence from ML research that ANN generalization is due to symmetries in the parameter-function map which seem generic enough that they would apply mutatis mutandis to human brains, which also have a singular parameter-function map (see e.g. here).
Not really sure what you’re getting at here/why this is supposed to help your side
what you mean by this? (compare “we don’t know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model”. Is this the same thing?). Is there a good writeup anywhere of why we should expect this to happen? This seems speculative and unlikely to me
The fact that natural selection produced species with different goals/values/whatever isn’t evidence that that’s the only way to get those values, because “selection pressure” isn’t a mechanistic explanation. You need more info about how values are actually implemented to rule out that a proposed alternative route to natural selection succeeds in reproducing them.
Re: ontological shifts, see this arbital page: https://arbital.com/p/ontology_identification.
I’m not claiming that evolution is the only way to get those values, merely that there’s no reason to expect you’ll get them by default by a totally different mechanism. The fact that we don’t have a good understanding of how values form even in the biological domain is a reason for pessimism, not optimism.
The point I was trying to make is that natural selection isn’t a “mechanism” in the right sense at all. it’s a causal/historical explanation not an account of how values are implemented. What is the evidence from evolution? The fact that species with different natural histories end up with different values really doesn’t tell us much without a discussion of mechanisms. We need to know 1) how different are the mechanisms actually used to point biological and artificial cognitive systems toward ends and 2) how many possible mechanisms to do so are there.
One reason for pessimism would be that human value learning has too many messy details. But LLMs are already better behaved than anything in the animal kingdom besides humans and are pretty good at intuitively following instructions, so there is not much evidence for this problem. If you think they are not so brainlike, then this is evidence that not-so-brainlike mechanisms work. And there are also theories that value learning in current AI works roughly similarly to value learning in the brain.
Which is just to say I don’t see the prior for pessimism, just from looking at evolution.
The orthogonality thesis is trivially a motte and bailey—you’re using it as one right here! The original claim by Bostrom was a statement against logical necessity: ‘an artificial mind need not care intrinsically about any of those things’ (emphasis mine); yet in your comment you’re equivocating with a statement that’s effectively about probability: ‘sampling from an extremely tiny section of an enormously broad space’.
You might be right in your claim, but your claim is not what the arguments in the orthogonality thesis papers purport to show.
I would also like to make a stronger counterclaim: I think a priori arguments about ‘probability space’ (dis)prove way too much. If you disregard empirical data, you can use them to disprove anything, like ‘the height of Earth fauna is contingent on specific details of various evolutionary pressures and environmental circumstances, and is sampled from a tiny section on the number line, so we should expect that alien fauna we encounter will be arbitrarily tall (or perhaps have negative height)’. If Earth-evolved intelligence tends even weakly to have e.g. sympathy towards non-kin, that is evidence that Earth-evolved intelligence is a biased sample, but also evidence that there exists some pull towards non-kin-sympathy in intelligence space.
My sense is that (as your footnote hints at), the more intelligent animals are, the more examples we seem to see of individual non-reciprocal altruism to non-kin (there are many clear examples of non-reciprocal altruism across species in cetaceans for e.g., and less numerous but still convincing examples of it in corvids).
As a side note the actual things that break this loop are
(1) we don’t use superintelligent singletons and probably won’t, I hope. We instead create context limited model instances of a larger model and tell it only about our task and the model doesn’t retain information. This “break an ASI into a billion instances each which lives only in the moment” is a powerful alignment method
(2) it seems to take an absolutely immense amount of compute hardware to host even today’s models which are significantly below human intelligence in some expensive to fix ways. (For example how many H100s would you need for useful realtime video perception?)
This means a “rogue” Singleton would have nowhere to exist, as it would be too heavy in weights and required bandwidth to run on a botnet.
This breaks everything else.
It’s telling that Bostroms PhD is in philosophy and I don’t see any industry experience on his wiki page. He is correct if you ignore real world limitations on AI.
FYI, current cutting-edge large language models are trained on a massive amount of text on the internet (in the case of GPT-4, likely approximately all the text OpenAI could get their hands on). So they certainly have tons of information about stuff other than the task at hand.
This is not what that statement means.
What it means is the model has no context of its history since training. It has no context if the task it has been given is “real”. It does not know if other copies of itself or other AIs are checking it’s outputs for correctness, with serious consequences if it sabotages the output. It doesn’t know it’s not still in training. It doesn’t know if there are a billion instances of it or just 1.
We can scrub all this information fairly easily and we already do this as of right now.
We can also make trick output where we try to elicit latent deception by giving information that would tell the model its time to betray.
We can also work backwards and find what the adversarial inputs are. When will the model change it’s answer for this question?
This essay seems predicated on a few major assumptions that aren’t quite spelled out, or any rate not presented as assumptions.
This assumes that making AI behave nice is genuine progress in alignment. The opposing take is that all it’s doing is making the AI play a nicer character, but doesn’t lead it to internalize its goals, which is what alignment is actually about. And in fact, AI playing rude characters was never the problem to begin with.
You say that alignment is linked to capability in the essay, but this also seems predicated on the above. This kind of “alignment” makes the AI better at figuring out what the humans want, but historically, most thinkers in alignment have always assumed that AI gets good at figuring out what humans want, and that it’s dangerous anyway.
What worries me the most is that the primary reason for this view that’s presented in the essay seems to be a social one (or otherwise, I missed it).
I.e., Miri’s approach to alignment hasn’t worked out, therefore the current work is better. But this argument doesn’t work—but approaches can be failures! I think Eliezer would argue that Miri’s work had a chance of leading to an alignment solution but has failed, whereas current alignment work (like RLHF on LLMs) has no chance of solving alignment.
If this is true, then the core argument of this essay collapses, and I don’t see a strong argument here that it’s not true. Why should we believe that Miri is wrong about alignment difficulty? The fact that their approach failed is not strong evidence of this; if they’re right, then they weren’t very likely to succeed in the first place.
And even if they’re completely wrong, that still doesn’t prove that current alignment approaches have a good chance of working.
Another assumption you make is that AGI is close and, in particular, will come out of LLMs. E.g.:
This is a case where you agree with most Miri staff but, e.g., Stuart Russel and Steven Byrnes are on record saying that we likely will not get AGI out of LLMs. If this is true, then RLHF done on LLMs is probably even less useful for alignment, and it also means the hard verdict on arguments in superintelligence is unwarranted. Things could still play out a lot more like classical AI alignment thinking in the paradigm that will actually give us AGI.
And I’m also not ready to toss out the inner vs. outer paradigm just because there was one post criticizing it.
I think this is a misleading frame which makes alignment seem harder than it actually is. What does it mean to “internalize” a goal? It’s something like, “you’ll keep pursuing the goal in new situations.” In other words, goal-internalization is a generalization problem.
We know a fair bit about how neural nets generalize, although we should study it more (I’m working on a paper on the topic atm). We know they favor “simple” functions, which means something like “low frequency” in the Fourier domain. In any case, I don’t see any reason to think the neural net prior is malign, or particularly biased toward deceptive, misaligned generalization. If anything the simplicity prior seems like good news for alignment.
I think internalizing X means “pursuing X as a terminal goal”, whereas RLHF arguably only makes model pursue X as an instrumental goal (in which case the model would be deceptively aligned). I’m not saying that GPT-4 has a distinction between instrumental and terminal goals, but a future AGI, whether an LLM or not, could have terminal goals that are different from instrumental goals.
You might argue that deceptive alignment is also an obsolete paradigm, but I would again respond that we don’t know this, or at any rate, that the essay doesn’t make the argument.
I don’t think the terminal vs. instrumental goal dichotomy is very helpful, because it shifts the focus away from behavioral stuff we can actually measure (at least in principle). I also don’t think humans exhibit this distinction particularly strongly. I would prefer to talk about generalization, which is much more empirically testable and has a practical meaning.
What if it just is the case that AI will be dangerous for reasons that current systems don’t exhibit, and hence we don’t have empirical data on? If that’s the case, then limiting our concerns to only concepts that can be empirically tested seems like it means setting ourselves up for failure.
I’m not sure what one is supposed to do with a claim that can’t be empirically tested—do we just believe it/act as if it’s true forever? Wouldn’t this simply mean an unlimited pause in AI development (and why does this only apply to AI)?
In principle, we do the same thing as with any claim (whether explicitly or otherwise):
- Estimate the expected value of (directly) testing the claim.
- Test it if and only if (directly) testing it has positive EV.
The point here isn’t that the claim is special, or that AI is special—just that the EV calculation consistently comes out negative (unless someone else is about to do something even more dangerous—hence the need for coordination).
This is unusual and inconvenient. It appears to be the hand we’ve been dealt.
I think you’re asking the right question: what is one supposed to do with a claim that can’t be empirically tested?
So just to summarize:
No deceptive or dangerous AI has ever been built or empirically tested. (1)
Historically AI capabilities have consistently been “underwhelming”, far below the hype. (2)
If we discuss “ok we build a large AGI, give it persistent memory and online learning, and isolate it in an air gapped data center and hand carry data to the machine via hardware locked media, what is the danger” you are going to respond either with:
“I don’t know how the model escapes but it’s so smart it will find a way” or (3)
“I am confident humanity will exist very far into the future so a small risk now is unacceptable (say 1-10 percent pDoom)”.
and if I point out that this large ASI model needs thousands of H100 accelerator cards and megawatts of power and specialized network topology to exist and there is nowhere to escape to, you will argue “it will optimize itself to fit on consumer PCs and escape to a botnet”. (4)
Have I summarized the arguments?
Like we’re supposed to coordinate an international pause and I see 4 unproven assertions above that have zero direct evidence. The one about humanity existing far into the future I don’t know I don’t want to argue that because it’s not falsifiable.
Shouldn’t we wait for evidence?
Thanks I mean more in terms of “how can we productively resolve our disagreements about this?”, which the EV calculations are downstream of. To be clear, it doesn’t seem to me that this is necessarily the hand we’ve been dealt but I’m not sure how to reduce the uncertainty.
At the risk of sidestepping the question, the obvious move seems to be “try harder to make the claim empirically testable”! For example, in the case of deception, which I think is a central example we could (not claiming these ideas are novel):
Test directly for deception behaviourally and/or mechanistically (I’m aware that people are doing this, think it’s good and wish the results were more broadly shared).
Think about what aspects of deception make it particularly hard, and try to study those in isolation and test those. The most important example seems to me to be precursors: finding more testable analogues to the question of “before we get good, undetectable deception do we get kind of crappy detectable deception?”
Obviously these all run some (imo substantially lower) risks but seem well worth doing. Before we declare the question empirically inaccessible we should at least do these and synthesise the results (for instance, what does grokking say about (2)?).
(I’m spinning this comment out because it’s pretty different in style and seems worth being able to reply to separately. Please let me know if this kind of chain-posting is frowned upon here.)
Another downside to declaring things empirically out of reach and relying on priors for your EV calculations and subsequent actions is that it more-or-less inevitably converts epistemic disagreements into conflict.
If it seems likely to you that this is the way things are (and so we should pause indefinitely) but it seems highly unlikely to me (and so we should not) then we have no choice but to just advocate for different things. There’s not even the prospect of having recourse to better evidence to win over third parties, so the conflict becomes no-holds-barred. I see this right now on Twitter and it makes me very sad. I think we can do better.
(apologies for slowness; I’m not here much)
I’d say it’s more about being willing to update on less direct evidence when the risk of getting more direct evidence is high.
Clearly we should aim to get more evidence. The question is how to best do that safely. At present we seem to be taking the default path—of gathering evidence in about the easiest way, rather than going for something harder, slower and safer. (e.g. all the “we need to work with frontier models” stuff; I do expect that’s most efficient on the empirical side; I don’t expect it’s a safe approach)
I think a lot depends on whether we’re:
Aiming to demonstrate that deception can happen.
Aiming to robustly avoid deception.
For demonstration, we can certainly do useful empirical stuff—ARC Evals already did the lying-to-a-taskrabbit worker demonstration (clearly this isn’t anything like deceptive alignment, but it’s deception [given suitable scaffolding]).
I think that other demonstrations of this kind will be useful in the short term.
For avoiding all forms of deception, I’m much more pessimistic—since this requires us to have no blind-spots, and to address the problem in a fundamentally general way. (personally I doubt there’s a [general solution to all kinds of deception] without some pretty general alignment solution—though I may be wrong)
I’m sure we’ll come up with solutions to particular types of / definitions of deception in particular contexts. This doesn’t necessarily tell us much about other types of deception in other contexts. (for example, this kind of thing—but not only this kind of thing)
I’d also note that “reducing the uncertainty” is only progress when we’re correct. The problem that kills us isn’t uncertainty, but overconfidence. (though granted it might be someone else’s overconfidence)
You need to have some motivation for thinking that a fundamentally new kind of danger will emerge in future systems, in such a way that we won’t be able to handle it as it arises. Otherwise anyone can come up with any nonsense they like.
If you’re talking about e.g. Evan Hubinger’s arguments for deceptive alignment, I think those arguments are very bad, in light of 1) the white box argument I give in this post, 2) the incoherence of Evan’s notion of “mechanistic optimization,” and 3) his reliance on “counting arguments” where you’re supposed to assume that the “inner goals” of the AI are sampled “uniformly at random” from some uninformative prior over goals (I don’t think the LLM / deep learning prior is uninformative in this sense at all).
That was what everyone ins AI safety was discussing for a decade or more, until around 2018. You seem to ignore these arguments about why AI will be dangerous, as well as all of the arguments that alignment will be hard. Are you familiar with all of that work?
the ideological turing test seems like a case where the distinction can be seen clearly in humans; the instrumental goal is to persuade the other that you sincerely hold beliefs/values (which imply goals). while your terminal goal is to advance advocacy of your different actual beliefs.
I definitely disagree with this—especially the last sentence; essentially all of my hope for neural net inductive biases comes from them not being like an actual simplicity prior. The primary literature I’d reference here would be “How likely is deceptive alignment?” for the practical question regarding concrete neural net inductive biases and “The Solomonoff Prior is Malign” for the purely theoretical question concerning the actual simplicity prior.
So, I definitely don’t have the Solomonoff prior in mind when I talk about simplicity. I’m actively doing research at the moment to better characterize the sense in which neural nets are biased toward “simple” functions, but I would be shocked if it has anything to do with Kolmogorov complexity.
Okay, my crux is that the simplicity/Kolmogorov/Solomonoff prior is probably not very malign, assuming we could run it, and in general I find the prior not to be malign except for specific situations.
This is basically because it relies on the IMO dubious assumption that the halting oracle can only be used once, and notably once we use the halting/Solomonoff oracle more than once, the Solomonoff oracle loses it’s malign properties.
More generally, if the Solomonoff Oracle is duplicatable, as modern AIs generally are, then there’s a known solution to mitigate the malignancy of the Solomonoff prior: Duplicate it, and let multiple people run the Solomonoff inductor in parallel to increase the complexity of manipulation. The goal is essentially to remove the uniqueness of 1 Solomonoff inductor, and make an arbitrary number of such oracles to drive up the complexity of manipulation.
So under a weak assumption, the malignancy of the Solomonoff prior goes away. This is described well in the link below, and the important part is that we need either a use-once condition, or we need to assume uniqueness in some way. If we don’t have either assumption holding, as is likely to be the case, then the Solomonoff/Kolmogorov prior isn’t malign.
https://www.lesswrong.com/posts/f7qcAS4DMKsMoxTmK/the-solomonoff-prior-is-malign-it-s-not-a-big-deal#Comparison_
And that’s if it’s actually malign, which it might not be, at least in the large-data limit:
https://www.lesswrong.com/posts/Tr7tAyt5zZpdTwTQK/the-solomonoff-prior-is-malign#fDEmEHEx5EuET4FBF
More specifically, it’s this part of John Wentworth’s comment:
As far as the actual practical question, there is a very important limitation on inner-misaligned agents by SGD, primarily because gradient hacking is very difficult to do, and is an underappreciated limitation on misalignment, since SGD has powerful tools to remove inner-misaligned circuits/TMs/Agents in the link below:
https://www.lesswrong.com/posts/w2TAEvME2yAG9MHeq/gradient-hacking-is-extremely-difficult
On the last part of your comment—if AGI doesn’t come out of LLMs then what would the justification for a pause be?
That progress is incredibly fast, and new architectures explicitly aimed at creating AGI are getting proposed and implemented. (I’m agnostic about whether LLMs will scale past human reasoning—it seems very plausible they won’t. But I don’t think it matters, because that’s not the only research direction with tons of resources being put into it that create existential risks.)
Interesting—what do you have in mind for fast-progressing architectures explicitly aimed at creating AGI?
On your 2nd point on x-risks from non-LLM AI, am I right in thinking that you would also hope to catch dual-use scientific AI (for instance) in a compute governance scheme and/or pause? That’s a considerably broader remit than I’ve seen advocates of a pause/compute restrictions argue for and seems much harder to achieve both politically and technically.
If regulators or model review firms have any flexibility (which seems very plausible,) and the danger of AGI is recognized (which seems increasingly likely,) once there is any recognition of promising progress towards AGI, review of the models for safety would occur—as it should, as in any other engineering discipline, albeit in this case more like civil engineering, where lives are on the line, than software engineering, where they usually aren’t.
And considering other risks, as I argued in my piece, there’s an existing requirement for countries to ban bioweapons development, again, as there should be. I’m simply proposing that countries should fulfill that obligation, in this case, by requiring review of potentially dangerous research into ML which can be applied to certain classes of virology.
I feel like I detect a missing mood from you where you’re skeptical of pausing (for plausible-to-me reasons), but you’re not conflicted about it like I am and you don’t e.g. look for ways to buy time or ways for regulation to help without the downsides of a pause. (Sorry if this sounds adversarial.) Relatedly, this post is one-sided and so feels soldier-mindset-y. Likely this is just due to the focus on debating AI pause. But I would feel reassured if you said you’re sympathetic to: labs not publishing capabilities research, labs not publishing model weights, dangerous-capability-model-eval-based regulation, US and allies slowing other states and denying them compute, and/or other ways to slow AI or for regulation to help. If you’re unsympathetic to these, I would doubt that the overhang nuances you discuss are your true rejection (but I’d be interested in hearing more about your take on slowing and regulation outside of “pause”).
Edit: man, I wrote this after writing four object-level comments but this got voted to the top. Please note that I’m mostly engaging on the object level and I think object-level discussion is generally more productive—and I think Nora’s post makes several good points.
I think people have started to stretch the “missing mood” concept a bit too far for my taste.
What actual mood is missing here?
If you think that the default path of AI development leads towards eventual x-risk safety, but that rash actions like an AI pause could plausibly push us off that path and into catastrophe, then your default moods would be “fervent desire to dissuade people from doing the potentially disastrous thing”, and “happy that the disastrous thing probably won’t happen”. I think this matches with the mood the OP has provided.
I worry that these sort of meta-critiques can inadvertently be used to pressure people into one side of object-level disagreements. This isn’t a dig at you in particular, and I acknowledge that you made object level points as well, which really should be higher than this comment.
Noticing the irony that this very natural AI safety idea is (in Nora’s view) actually counterproductive and so constructively searching for ways to modify it and for adjacent ideas that don’t have its downsides
Sympathy with the pro-pause position and its proponents
Also feeling more conflicted in general—there are several real considerations in favor of pausing and Nora doesn’t grapple with them. (But this is a debate and maybe Nora is deliberately one-sidedly arguing for a particular position.)
Maybe “missing mood” isn’t exactly the right concept.
So the point of the “missing mood” concept was that it was an indicator for motivated reasoning. If someone reports to you that “lithuanians are genetically bad at chess” with a mood of unrestrained glee, you can rightly get suspicious of their methods. If they weren’t already prejudiced against lithuanians, they would find the result about chess ability sad and unfortunate.
I see no similar indicators here. From nora’s perspective, the AI pause and similar proposals are a bomb that will hurl us much closer to catastrophe. Why, (from their perspective) would there be a requirement to show sympathy for the bomb-throwers, or propose a modified bomb design?
Now of course, as a human being nora will have pre-existing biases towards one side or the other, and you can pick apart the piece if you want to find evidence of that (like using the phrase “heavy handed government regulation”). But having some bias towards one side doesn’t mean your arguments are wrong. The meta can have some uses if it’s truly blatant, but it’s the object level that actually matters.
If you desperately wish we had more time to work on alignment, but also think a pause won’t make that happen or would have larger countervailing costs, then that would lead to an attitude like: “If only we had more time! But alas, a pause would only make things worse. Let’s talk about other ideas…” For my part, I definitely say things like that (see here).
However, Nora has sections claiming “alignment is doing pretty well” and “alignment optimism”, so I think it’s self-consistent for her to not express that kind of mood.
Insofar as Nora discusses nuances of overhang, it would be odd and annoying for that to not actually be cruxy for her (given that she doesn’t say something like this isn’t cruxy for me).
I was reading it as a kinda disjunctive argument. If Nora says that a pause is bad because of A and B, either of which is sufficient on its own from her perspective, then you could say “A isn’t cruxy for her” (because B is sufficient) or you could say “B isn’t cruxy for her” (because A is sufficient). Really, neither of those claims is accurate.
Oh well, whatever, I agree with you that the OP could have been clearer.
Yep it’s all meant to be disjunctive and yep it could have been clearer. FWIW this essay went through multiple major revisions and at one point I was trying to make the disjunctivity of it super clear but then that got de-prioritized relative to other stuff. In the future if/when I write about this I think I’ll be able to organize things significantly better
Hmm, AI safety is probably easy implies that slowing AI is lower-stakes but doesn’t obviously imply much about whether it’s net-positive. It’s not obvious to me what alignment optimism has to do with the pause debate, and I don’t think you discuss this.
Sorry, I thought it would be fairly obvious how it’s related. If you’re optimistic about alignment then the expected benefits you might hope to get out of a pause (whether or not you actually do get those benefits) are commensurately smaller, so the unintended consequences should have more relative weight in your EV calculation.
To be clear, I think slowing down AI in general, as opposed to the moratorium proposal in particular, is a more reasonable position that’s a bit harder to argue against. I do still think the overhang concerns apply in non-pause slowdowns but in a less acute manner.
Given alignment optimism, the benefits of pause are smaller—but the unintended consequences for alignment are smaller too. I guess alignment optimism suggests pause-is-bad if e.g. your alignment optimism is super conditional on smooth progress...
Could you say more about what you see as the practical distinction between a “slow down AI in general” proposal vs. a “pause” proposal?
Where we agree:
“dangerous-capability-model-eval-based regulation” sounds good to me. I’m also in favor of Robin Hanson’s foom liability proposal. These seem like very targeted measures that would plausibly reduce the tail risk of existential catastrophe, and don’t have many negative side effects. I’m also not opposed to the US trying to slow down other states, although it’d depend on the specifics of the proposal.
Where we (partially) disagree:
I think there’s a plausible case to be made that publishing model weights reduces foom risk by making AI capabilities more broadly distributed, and also enhances security-by-transparency. Of course there are concerns about misuse— I do think that’s a real thing to be worried about— but I also think it’s generally exaggerated. I also relatively strongly favor open source on purely normative grounds. So my inclination is to be in favor of it but with reservations. Same goes for labs publishing capabilities research.
I feel like you’re trying to round these three things into a “yay versus boo” axis, and then come down on the side of “boo”. I think we can try to do better than that.
One can make certain general claims about learning algorithms that are true and for which evolution provides as good an example as any. One can also make other claims that are true for evolution and false for other learning algorithms. and then we can argue about which category future AGI will be in. I think we should be open to that kind of dialog, and it involves talking about evolution.
Likewise, I think “inner misalignment versus outer misalignment” is a helpful and valid way to classify certain failure modes of certain AI algorithms.
For the third one, there’s an argument like:
“Maybe the AI will really want something-or-other to happen in the future, and try to make it happen, including by long-term planning—y’know, the way some humans really want to break out of prison, or the way Elon Musk really wants to go to Mars. Maybe the AIs have other desires and do other things too, but that’s not too relevant to what I’m saying. Next, There are a lot of reasons to think that “AIs that really want something-or-other to happen in the future” will show up sooner or later, e.g. the fact that smart people have been trying to build them since the dawn of AI and continuing through today. And if we get such AIs, and they’re very smart and competent, it has similar relevant consequences as “rigid utility maximizing consequentialists”—particularly power-seeking / instrumental convergence, and not pursuing plans that have obvious and effective countermeasures.”
Do you buy that argument? If so, I think some discussions of “rigid utility maximizing consequentialists” can be useful. I also think that some such discussions can lead to conclusions that do not necessarily transfer to more realistic AGIs (see here). So again, I think we should avoid yay-versus-boo thinking.
I think that part of the blog post you linked was being facetious. IIUC they had some undisclosed research program involving Haskell for a few years, and then they partly but not entirely wound it down when it wasn’t going as well as they had hoped. But they have also been doing other things too the whole time, like their agent foundations team. (I have no personal knowledge beyond reading the newsletters etc.)
For example, FWIW, I have personally found MIRI employee Abram Demski’s blog posts (including pre-2020) to be very helpful to my thinking about AGI alignment.
Anyway, your more general claim in this section seems to be: Given current levels of capabilities, there is no more alignment research to be done. We’re tapped out. The well is dry. The only possible thing left to do is twiddle our thumbs and wait for more capable models to come out.
Is that really your belief? Do you look at literally everything on alignmentforum etc. as total garbage? Obviously I have a COI but I happen to think there is lots of alignment work yet to do that would be helpful and does not need newly-advanced capabilities to happen.
Nothing in this comment should be construed as “all things considered we should be for or against the pause”—as it happens I’m weakly against the pause too—these are narrower points. :)
I certainly give relatively little weight to most conceptual AI research. That said, I respect that it’s valuable for you and am open to trying to narrow the gap between our views here—I’m just not sure how!
To be more concrete, I’d value 1 year of current progress over 10 years of pre-2018 research (to pick a date relatively arbitrarily). I don’t intend this as an attack on the earlier alignment community, I just think we’re making empirical progress in a way that was pretty much impossible before we had good models available to study and I place a lot more value on this.
I have a vague impression—I forget from where and it may well be false—that Nora has read some of my AI alignment research, and that she thinks of it as not entirely pointless. If so, then when I say “pre-2020 MIRI (esp. Abram & Eliezer) deserve some share of the credit for my thinking”, then that’s meaningful, because there is in fact some nonzero credit to be given. Conversely, if you (or anyone) don’t know anything about my AI alignment research, or think it’s dumb, then you should ignore that part of my comment, it’s not offering any evidence, it would just be saying that useless research can sometimes lead to further useless research, which is obvious! :)
I probably think less of current “empirical” research than you, because I don’t think AGI will look and act and be built just like today’s LLMs but better / larger. I expect highly-alignment-relevant differences between here and there, including (among other things) reinforcement learning being involved in a much more central way than it is today (i.e. RLHF fine-tuning). This is a big topic where I think reasonable people disagree and maybe this comment section isn’t a great place to hash it out. ¯\_(ツ)_/¯
My own research doesn’t involve LLMs and could have been done in 2017, but I’m not sure I would call it “purely conceptual”—it involves a lot of stuff like scrutinizing data tables in experimental neuroscience papers. The ELK research project led by Paul Christiano also could have been done 2017, as far as I can tell, but lots of people seem to think it’s worthwhile; do you? (Paul is a coinventor of RLHF.)
I’ve certainly heard of your work but it’s far enough out of my research interests that I’ve never taken a particularly strong interest. Writing this in this context makes me realise I might have made a bit of a one-man echo chamber for myself… Do you mind if we leave this as ‘undecided’ for a while?
Regarding ELK—I think the core of the problem as I understand it is fairly clear once you begin thinking about interpretability. Understanding the relation between AI and human ontologies was part of the motivation behind my work on alphazero (as well as an interest in the natural abstractions hypothesis). Section 4 “Encoding of human conceptual knowledge” and Section 8 “Exploring activations with unsupervised methods” are the places to look. The section on challenges and limitations in concept probing I think echoes a lot of the concerns in ELK.
In terms of subsequent work on ELK, I don’t think much of the work on solving ELK was particularly useful, and often reinvented existing methods (e.g. sparse probing, causal interchange interventions). If I were to try and work on it then I think the best way to do so would be to embed the core challenge in a tractable research program, for instance trying to extract new scientific knowledge from ML models like alphafold.
To move this in a more positive direction, the most fruitful/exciting conceptual work I’ve seen is probably (1) the natural abstractions hypothesis and (2) debate. When I think a bit about why I particularly like these, for (1) it’s because it seems plausibly true, extremely useful if true, and amenable to both formal theoretical work and empirical study. For (2) it’s because it’s a pretty striking new idea that seems very powerful/scalable, but also can be put into practice a bit ahead of really powerful systems.
It’s perhaps also worth separating the claims that A) previous alignment research was significantly less helpful than today’s research and B) the reason that was the case continues to hold today.
I think I’d agree with some version of A, but strongly disagree with B.
The reason that A seems probably true to me is that we didn’t know the basic paradigm in which AGI would arise, and so previous research was forced to wander in the dark. You might also believe that today’s focus on empirical research is better than yesterday’s focus on theoretical research (I don’t necessarily agree) or at least that theoretical research without empirical feedback is on thin ice (I agree).
I think most people now think that deep learning, perhaps with some modifications, will be what leads to AGI—some even think that LLM-like systems will be sufficient. And the shift from primarily theoretical research to primarily empirical research has already happened. So what will cause today’s research to be worse than future research with more capable models? You can appeal to a general principle of “unknown unknowns,” but if you genuinely believe that deep learning (or LLMs) will eventually be used in future AGI, it seems hard to believe that knowledge won’t transfer at all.
Steven the issue is without empirical data you end up with a branching tree of possible futures. And if you make some faulty assumptions early—such as assuming the amount of compute needed to host optimal AI models is small and easily stolen via hacking—you end up lost in a tree of possibilities where every one you consider is “doom”. And thus you arrive at the conclusion of “pDoom is 99 percent”, because you are only cognitively able to consider adjacent futures in the possibility tree. No living human can keep track of thousands of possibilities in parallel. This is where I think Eliezer and Zvi are lost, where they simply ignore branches that would lead to different outcomes.
(And vice versa, you could arrive at the opposite conclusion).
It becomes angels at the head of a pin. There is no way to make a policy decision based on this. You need to prove you beliefs with data. It’s how we even got here as a species.
One of the three major threads in this post (I think) is noticing pause downsides: in reality, an “AI pause” would have various predictable downsides.
Part of this is your central overhang concerns, which I discuss in another comment. The rest is:
(I have some relevant ideas in Cruxes on US lead for some domestic AI regulation and Cruxes for overhang.)
My high-level take: suppose for illustration that “powerful AI” is binary and powerful AI would appear by default (i.e. with no pause) in 2030 via a 1e30 FLOP training run. (GPT-4 used about 2e25 FLOP.) Several of these concerns would apply to 1e23 FLOP ceiling but not to a 1e28 FLOP ceiling—a 1e28 ceiling would delay powerful AI (powerful AI would be reached by inference-time algorithmic progress and compute increase) but likely not become evadable, let other countries surpass US, etc. I mostly agree with Nora that low-ceiling pauses are misguided—but the upshot of that for me is not “pause is bad” but “pauses should have a high ceiling.”
Unfortunately, it’s pretty uncertain when powerful AI would appear by default, and even if you know the optimal threshold for regulation you can’t automatically cause that to occur. But some policy regimes would be more robust to mistakes than “aim 2–3 OOMs below when powerful AI would appear and pause there”—e.g. starting around 1e26 today and doubling every year.
Specific takes:
Yeah, policy regimes should have enforcement to prevent evasion. This is a force pushing toward higher ceilings. Maybe you think evasion is inevitable? I don’t, at least for reasonably high ceilings, although I don’t know much about it.
Idk, depends on the details of the policy. I think some experts think US regulation on training runs would largely apply beyond US borders (extraterritoriality); my impression is US can disallow foreign companies from using US nationals’ labor, at least.
I agree that if US pauses and loses its lead that’s a big downside. I don’t think that’s inevitable, although it is a force pushing toward less ambitious pauses / higher ceilings. See Cruxes on US lead for some domestic AI regulation.
Hmm, I don’t see how most pause-proposals I’ve heard of would require government approval for safety research. The exception is proposals that would require government approval for fine-tuning frontier LLMs. Is that right; would it be fine if the regulation only hit base-model training runs (or fine-tuning with absurd amounts of compute, like >1e23 FLOP)?
My impression is that pause regulation would probably use the metric training compute, which is hard to game.
Yeah :( so you have to raise the ceiling over time, or set it sufficiently high that you get powerful AI from inference-time progress [algorithmic progress + hardware progress + increased spending] before the policy becomes unenforceable. Go for a less-ambitious pause like 1e27 training FLOP or something.
This isn’t directly bad but it entails maybe the pause is suddenly reversed which is directly bad. This is a major risk from a pause, I think.
To some extent I think it would be good for a liberal pause alliance to impose its will on defectors. To some extent the US / the liberal pause alliance should set the ceiling sufficiently high that they can pause without losing their lead and should attempt to slow defectors e.g. via export controls.
Maybe sorta. Violence sounds implausible.
Partially agree. So (a) go for a less-ambitious pause like 1e27 training FLOP or something, (b) try to slow other countries, and (c) note they partially slow when the US slows because US progress largely causes foreign progress (via publishing research and sharing models).
Super cruxy for the effects of various possible pauses is how long it would take e.g. China to catch up if US/UK/Europe paused, and how much US/UK/Europe could slow China. I really wish we knew this.
Just something that jumped out at me. Suppose a pause is on 1e28+ training runs.
The human brain is made of modules organized in a way we don’t understand. But we do know the frontal lobes associated with executive functions are a small part of the total tissue.
This means an AI system could be a collection of a few dozen specialized 1e28 models separated by api calls, hosted in a common data center for low latency interconnects.
If a “few dozen” is 100+ modules the total compute used would be 1e30 and it might be possible to make this system an AGI with difficult training tasks to cause this level of cognitive development through feedback.
Especially with “meta” system architectures where new modules could be automatically added to improve score where deficiencies are present in a way that training existing weights is leading to regressions.
Interesting—something to watch out for! Perhaps it could be caught by limiting the number of training runs any individual actor can do that are close to / at the FLOP limit (to 1/year?). Of course then actors intent on it could try and use a maze of shell companies or something, but that could be addressed by requiring complete financial records and audits.
Sure. In practice there’s the national sovereignty angle though. This just devolves to each party “complies” with the agreement, violating it in various ways. Too much incentive to defect.
The US government just never audits its secret national labs, China just never checks anything, Israel just openly decides they can’t afford to comply at all etc. Everyone claims to be in compliance.
Really depends on how much of a taboo develops around AGI. If it’s driven underground it becomes much less likely to happen given the resources required.
So my thought on this is I think of flamethrowers and gas shells and the worst ww1 battlefields. I am not sure what taboo humans won’t violate in order to win.
This isn’t war though. What are some peace-time examples of taboo violations (especially state-sanctioned ones)? I can only really think of North Korea and a handful of other pariah states (none of which would be capable of developing AGI).
This can be avoided with a treaty that requires full access given to international inspectors. This already happens with the IAEA and was set up even in the far greater tensions of the cold war. If someone like Iran tries to kick out the inspectors, everyone assumes they’re trying to develop nuclear weapons and takes serious action (harsh sanctions, airstrikes, even the threat of war).
If governments think of this as an existential threat, they should agree to it for the same reasons they did with the IAEA. And while there’s big incentives to defect (unless they have very high p(doom)), there is also the knowledge that kicking out inspectors will lead to potential war and their rivals defecting too.
If this turns out to be feasible, one solution would be to have people on-site (or make TSMC put hardware level controls in place) to randomly sample from the training data several times a day to verify outside data isn’t involved in the training run.
Random aside, but I think this paragraph is unjustified in both its core argument (that the referenced theory-first efforts propagated actively misleading ways of thinking about alignment) and none of the citations provide the claimed support.
The first post (re: evolutionary analogy as evidence for a sharp left turn) sees substantial pushback in the comments, and that pushback seems more correct to me than not, and in any case seems to misunderstand the position it’s arguing against.
The second post presents an interesting case for a set of claims that are different from “there is no distinction between inner and outer alignment”; I do not consider it to be a full refutation of that conceptual distinction. (See also Steven Byrnes’ comment.)
The third post is at best playing games with the definitions of words (or misunderstanding the thing it’s arguing against), at worst is just straightforwardly wrong.
I have less context on the fourth post, but from a quick skim of both the post and the comments, I think the way it’s most relevant here is as a demonstration of how important it is to be careful and precise with one’s claims. (The post is not making an argument about whether AIs will be “rigid utility maximizing consequentialists”, it is making a variety of arguments about whether coherence theorems necessarily require that whatever ASI we might build will behave in a goal-directed way. Relatedly, Rohin’s comment a year after writing that post indicated that he thinks we’re likely to develop goal-directed agents; he just doesn’t think that’s entailed by arguments from coherence theorems, which may or may not have been made by e.g. Eliezer in other essays.)
My guess is that you did not include the fifth post as a smoke test to see if anyone was checking your citations, but I am having trouble coming up with a charitable explanation for its inclusion in support of your argument.
I’m not really sure what my takeaway is here, except that I didn’t go scouring the essay for mistakes—the citation of Quintin’s post was just the first thing that jumped out at me, since that wasn’t all that long ago. I think the claims made in the paragraph are basically unsupported by the evidence, and the evidence itself is substantially mischaracterized. Based on other comments it looks like this is true of a bunch of other substantial claims and arguments in the post:
that Bostrom’s core argument has aged poorly
CIRL being widely considered irrelevant
whether proposed pauses are intended to be temporary[1]
Though I’m sort of confused about what this back-and-forth is talking about, since it’s referencing behind-the-scenes stuff that I’m not privy to.
I agree that alignment research would suffer during a pause, but I’ve been wondering recently how much of an issue that is. The key point is that capabilities research would also be paused, so it’s not like AI capabilities would be racing ahead of our knowledge on how to control ever more powerful systems. You’d simply be delaying both capabilities and alignment progress.
You might then ask—what’s the point of a pause if alignment research stops? Isn’t the whole point of a pause to figure out alignment?
I’m not sure that’s the whole point of a pause. A pause can also give us time to figure out optimal governance structures whether it be standards, regulations etc. These structures can be very important in reducing x-risk. Even if the U.S. is the only country to pause that still gives us more time, because the U.S. is currently in the lead.
I realise you make other points against a pause (which I think might be valid), but I would welcome thoughts on the ‘having more time for governance’ point specifically.
Thanks very much for writing this very interesting piece!
The “AI safety winter” section argues that pre-2020, AI alignment researchers made little progress because they had no AI to work on aligning. But now that we have GPT-4 etc., I feel like we have a capabilities overhang, and it seems like there is plenty of AI alignment researchers to work on for the next 6 months or so? Then their work could be ‘tested’ by allowing some more algorithmic progress.
This post has definitely made me more pessimistic on a pause, particularly:
• If we pause, it’s not clear on how much extra time we get at the end and how much this costs us in terms of crunch time.
• The implementation details are tricky and actors are incentivised to try to work around the limitations.
On the other hand, I disagree with the following:
• That it is clear that alignment is doing well. There are different possible difficulty levels that alignment could have. I agree that we are in an easier world, where ChatGPT has already achieved a greater amount of outer alignment than we would have expected from some of the old arguments about the impossibility of listing all of our implicit conditions. On the other hand, it’s not at all clear that we’re anywhere near close to scalable alignment techniques, so there’s a pretty decent argument that we’re far behind where we need to be.
• Labelling AI’s as white box merely because we can see all of the weights. You’ve got a point. I can see where you’re coming from. However, I’m worried that your framing is confusing and will cause people to talk past each other.
• That if there was a pause, alignment research would magically revert back to what it was back in the MIRI days. Admittedly, this is more implied than literally stated, but if we take it literally then it’s absurd. There’s no shortage of empirical experiments for people to run at the current capability level.
• A large part of the reason why alignment progress was so limited during the last “pause” was that only a very few people were working on it. They certainly made mistakes, but I don’t think you’re fully appreciating the value of the conceptual framework that we inherited from them and how that’s informed the empirical work.
The claim is more like, “the MIRI days are a cautionary tale about what may happen when alignment research isn’t embedded inside a feedback loop with capabilities.” I don’t literally believe we would revert back to pure theoretical research during a pause, but I do think the research would get considerably lower quality.
Perhaps, but I think the current conventional wisdom that neural nets are “black box” is itself a confusing and bad framing and I’m trying to displace it.
AI safety currently seems to heavily lean towards empirical and this emphasis only seems to be growing, so I’m rather skeptical that a bit more theoretical work on the margin will be some kind of catastrophe. I’d actually expect it to be a net positive.
There are probably 100s of AI Alignment / Interpretability PhD theses that could be done on GPT-4 alone. That’s 5 years of empirical work right there without any further advances in capabilities.
Any serious Pause would be indefinite, and only lifted when there is global consensus on an alignment solution that provides sufficient x-safety. I think a lot of objections to Pause are based on the idea that it would be of fixed time limit. This is obviously unrealistic—when has there ever been an international treaty or moratorium that had a fixed expiry date?
One of the three major threads in this post (I think) is feedback loops & takeoff: for safety, causing capabilities to increase more gradually and have more time with more capable systems is important, relative to total time until powerful systems appear. By default, capabilities would increase gradually. A pause would create an “overhang” and would not be sustained forever; when the pause ends, the overhang entails that capabilities increase rapidly.
I kinda agree. I seem to think rapid increase in training compute is less likely, would be smaller, and would be less bad than you do. Some of the larger cruxes:
Magnitude of overhang: it seems the size of the largest training run largely isn’t about the cost of compute. Why hasn’t someone done a billion-dollar LLM training run, why did we only recently break $10M? I don’t know but I’d guess you can’t effectively (i.e. you get sharply diminishing returns for doing more than a couple orders of magnitude more than models that have been around for a while), or it’s hard to get a big cluster to parallelize and so the training run would take years, or something. Relevant meme:
Magnitude of overhang: endogeneity. AI progress improves AI progress, for reasons like Examples of AI Improving AI and normal iterating and learning from experience. This means takeoff is faster than otherwise, especially in no-pause worlds. So a pause makes fast takeoff worse but not as much as we’d naively think.
Badness of overhang: I seem to think total-time is more important relative to time-with-powerful-models than you, such that I’d accept a small overhang in exchange for a moderate amount of timeline. Shrug. This is probably because (a) I’m more pessimistic about alignment than you and (b) I’m more optimistic about current alignment research being useful for aligning powerful AI. Probably it’s not worth
I discuss my cruxes in Cruxes for overhang (also relevant: Cruxes on US lead for some domestic AI regulation).
This is too strong. Some pauses could be sustained through AGI, obviating the overhang problem. For example, if you pause slightly below AGI, you get to AGI via algorithmic improvements and inference-time compute increases—the pause doesn’t end and overhang isn’t an issue.
One of the three major threads in this post (I think) is alignment optimism: AI safety probably isn’t super hard.
A possible implication is that a pause is unnecessary. But the difficulty of alignment doesn’t seem to imply much about whether slowing is good or bad, or about its priority relative to other goals.
(I disagree that gradient descent entails “we are the innate reward system” and thus safe, or that “full read-write access to [AI systems’] internals” gives safety in the absence of great interpretability. I think likely failure modes include AI playing the training game, influence-seeking behavior dominating, misalignment during capabilities generalization, and catastrophic Goodharting, and that AGI Ruin: A List of Lethalities is largely right. But I think in this debate we should focus on determining optimal behavior as a function of the difficulty of alignment, rather than having intractable arguments about the difficulty of alignment.)
Yes. This one seems critical, and I don’t understand it at all.
At the extremes, if alignment-to-”good”-values by default was 100% likely I presume slowing down would be net-negative, and racing ahead would look great. It’s unclear to me where the tipping point is, what kind of distribution over different alignment difficulty levels one would need to have to tip from wanting to speed up vs wanting to slow down AI progress.
Seems to me like the more longtermist one is, the more slowing down looks good even when one is very optimistic about alignment. Then again there are some considerations that push against this: risk of totalitarianism, risk of pause that never ends, risk of value-agnostic alignment being solved and the first AGI being aligned to “worse” values than the default outcome.
(I realize I’m using two different definitions of alignment in this comment, would like to know if there’s standardized terminology to differentiate between them)
The second link just takes me to Alex Turner’s shortform page on LW, where ctrl+f-ing “assistance” doesn’t get me any results. I do find this comment when searching for “CIRL”, which criticizes the CIRL/assistance games research program, but does not claim that it is irrelevant to modern deep learning. For what it’s worth, I think it’s plausible that Alex Turner thinks that assistance games is mostly irrelevant to modern deep learning (and plausible that he doesn’t think that) - I merely object that the link provided doesn’t provide good evidence of that claim.
The first link is to Rohin Shah’s reviews of Human Compatible and some assistance games / CIRL research papers. ctrl+f-ing “deep” gets me two irrelevant results, plus one description of a paper “which is inspired by [the CIRL] paper and does a similar thing with deep RL”. It would be hard to write such a paper if CIRL (aka assistance games) was mostly irrelevant to modern deep learning. The closest thing I can find is in the summary of Human Compatible, which says “You might worry that the proposed solution [of making AI via CIRL / assistance games] is quite challenging: after all, it requires a shift in the entire way we do AI.”. This doesn’t make assistance games irrelevant to modern deep learning—in 2016, it would have been true to say that moving the main thrust of AI research to language modelling so as to produce helpful chatbots required a shift in the entire way we did AI, but research into deeply learned large language models was not irrelevant to deep learning as of 2016 - in fact, it sprung out of 2016-era deep learning.
Yeah, I don’t think it’s accurate to say that I see assistance games as mostly irrelevant to modern deep learning, and I especially don’t think that it makes sense to cite my review of Human Compatible to support that claim.
The one quote that Daniel mentions about shifting the entire way we do AI is a paraphrase of something Stuart says, and is responding to the paradigm of writing down fixed, programmatic reward functions. And in fact, we have now changed that dramatically through the use of RLHF, for which a lot of early work was done at CHAI, so I think this reflects positively on Stuart.
I’ll also note that in addition to the “Learning to Interactively Learn and Assist” paper that does CIRL with deep RL which Daniel cited above, I also wrote a paper with several CHAI colleagues that applied deep RL to solve assistance games.
My position is that you can roughly decompose the overall problem into two subproblems: (1) in theory, what should an AI system do? (2) Given a desire for what the AI system should do, how do we make it do that?
The formalization of assistance games is more about (1), saying that AI systems should behave more like assistants than like autonomous agents (basically the point of my paper linked above). These are mostly independent. Since deep learning is an answer to (2) while assistance games are an answer to (1), you can use deep learning to solve assistance games.
I’d also say that the current form factor of ChatGPT, Claude, Bard etc is very assistance-flavored, which seems like a clear success of prediction at least. On the other hand, it seems unlikely that CHAI’s work on CIRL had much causal impact on this, so in hindsight it looks less useful to have done this research.
All this being said, I view (2) as the more pressing problem for alignment, and so I spend most of my time on that, which implies not working on assistance games as much any more. So I think it’s overall reasonable to take me as mildly against work on assistance games (but not to take me as saying that it is irrelevant to modern deep learning).
I asked Alex “no chance you can comment on whether you think assistance games are mostly irrelevant to modern deep learning?”
His response was “i think it’s mostly irrelevant, yeah, with moderate confidence”. He then told me he’d lost his EA forum credentials and said I should feel free to cross-post his message here.
(For what it’s worth, as people may have guessed, I disagree with him—I think you can totally do CIRL-type stuff with modern deep learning, to the extent you can do anything with modern deep learning.)
Suppose you walk down a street, and unbeknownst to you, you’re walking by a dumpster that has a suitcase full of millions of dollars. There’s a sense in which you “can”, “at essentially no cost”, walk over and take the money. But you don’t know that you should, so you don’t. All the value is in the knowledge.
A trained model is like a computer program with a billion unlabeled parameters and no documentation. Being able to view the code is helpful but doesn’t make it “white box”. Saying it’s “essentially no cost” to “analyze” a trained model is just crazy. I’m pretty sure you have met people doing mechanistic interpretability, right? It’s not trivial. They spend months on their projects. The thing you said is just so crazy that I have to assume I’m misunderstanding you. Can you clarify?
Nora is Head of Interpretability at EleutherAI :)
It’s essentially no cost to run a gradient-based optimizer on a neural network, and I think this is sufficient for good-enough alignment. I view the the interpretability work I do at Eleuther as icing on the cake, allowing us to steer models even more effectively than we already can. Yes, it’s not zero cost, but it’s dramatically lower cost than it would be if we had to crack open a skull and do neurosurgery.
Also, if by “mechanistic interpretability” you mean “circuits” I’m honestly pretty pessimistic about the usefulness of that kind of research, and I think the really-useful stuff is lower cost than circuits-based interp.
If you want to say “it’s a black box but the box has a “gradient” output channel in addition to the “next-token-probability-distribution” output channel”, then I have no objection.
If you want to say ”...and those two output channels are sufficient for safe & beneficial AGI”, then you can say that too, although I happen to disagree.
If you want to say “we also have interpretability techniques on top of those, and they work well enough to ensure alignment for both current and future AIs”, then I’m open-minded and interested in details.
If you want to say “we can’t understand how a trained model does what it does in any detail, but if we had to drill into a skull and only measure a few neurons at a time etc. then things sure would be even worse!!”, then yeah duh.
But your OP said “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost”, and used the term “white box”. That’s the part that strikes me as crazy. To be charitable, I don’t think those words are communicating the message that you had intended to communicate.
For example, find a random software engineer on the street, and ask them: “if I give you a 1-terabyte compiled executable binary, and you can do whatever you want with that file on your home computer, would you describe it as closer to “white box” or “black box”?”. I predict most people would say “closer to black box”, even though they can look at all the bits and step through the execution and run decompilation tools etc. if they want. Likewise you can ask them whether it’s possible to “analyze” that binary “at essentially no cost”. I predict most people would say “no”.
Differentiability is a pretty big part of the white box argument.
The terabyte compiled executable binary is still white box in a minimal sense but it’s going to take a lot of work to mould that thing into something that does what you want. You’ll have to decompile it and do a lot of static analysis, and Rice’s theorem gets in the way of the kinds of stuff you can prove about it. The code might be adversarially obfuscated, although literal black box obfuscation is provably impossible.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff. And if I’m worried about the base model being preserved underneath and doing nefarious things, I can generate synthetic data from the fine tuned model and train a fresh network from scratch on that (although to be fair that’s pretty compute-intensive).
I don’t think “mouldability” is a synonym of “white-boxiness”. In fact, I think they’re hardly related at all:
There can be a black box with lots of knobs on the outside that change the box’s behavior. It’s still a black box.
Conversely, consider an old-fashioned bimetallic strip thermostat with a broken dial. It’s not mouldable at all—it can do one and only thing, i.e. actuate a switch at a certain fixed temperature. (Well, I guess you can use it as a doorstop!) But a bimetallic strip thermostat still very white-boxy (after I spend 30 seconds telling you how it works).
You wrote “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.” I feel like I keep pressing you on this, and you keep motte-and-bailey’ing into some other claim that does not align with a common-sense reading of what you originally wrote:
“Well, the cost of analysis could theoretically be even higher—like, if you had to drill into skulls…” OK sure but that’s not the same as “essentially no cost”.
“Well, the cost of analysis may be astronomically high, but there’s a theorem proving that it’s not theoretically impossible…” OK sure but that’s not the same as “essentially no cost”.
“Well, I can list out some specific analysis and manipulation tasks that we can do at essentially no cost: we can do X, and Y, and Z, …” OK sure but that’s not the same as “we can analyze and manipulate however we want at essentially no cost”.
Do you see what I mean?
But this is irrelevant to the original claim, right? Being able to fine-tune might make introspection on its interal algorithmic representations a bit cheaper, but in practice we observe that it takes us weeks or months of alignment researchers’ time to figure out what extremely tiny slices of two-generations-old LLMs are doing.
Well, a computer model is “literally” transparent in the sense that you can see everything, which means the only difficulty is only in understanding what it means. So the part where you spend 5 million dollars on a PET scanner doesn’t exist for ANNs, and in that sense you can analyze them for “free”.
If the understanding part is sufficiently difficult… which it sure seems to be… then this doesn’t really help, but it is a coherent conceptual difference.
Simple and genuine question from a non-AI guy
I understand the arguments towards encouraging gradual development vs. fast takeoff, but I don’t understand this argument I’ve heard multiple times (not just on this post) that “we need capabilities to increase so that we can stay up to date with alignment research”.
First I thought there’s still a lot of work we could do with current capabilities—technical alignment is surely limited by time, money and manpower not just by computing power. I’m also guessing less powerful AI could be made during a “pause” specifically for alignment research
Second in a theoretical situation where capabilities research globally stopped overnight, isn’t this just free-extra-time for the human race where we aren’t moving towards doom? That feels pretty valuable and high EV in and of itself.
It seems to me the argument would have to be that the advantage to the safety work of improving capabilities would outstrip the increasing risk of dangerous GAI, which I find hard to get my head around, but I might be missing something important.
Thanks.
Not responding to your main question:
I’m interpreting this as saying that buying humanity more time, in and of itself, is good.
I don’t think extra time pre-transformative-AI is particularly valuable except its impact on existential risk. Two reasons for why I think this:
Astronomical waste argument. Time post-transformative-AI is way more valuable than time now, assuming some (but strong version not necessary) aggregating/total utilitarianism. If I was trading clock-time seconds now for seconds a thousand years from now, assuming no difference in existential risk, I would probably be willing to trade every historical second of humans living good lives for like a minute a thousand years from now, because it seems like we could have a ton of (morally relevant) people in the future, and the moral value derived from their experience could be significantly greater than current humans.
The moral value of the current world seems plausibly negative due to large amounts of suffering. Factory farming, wild animal suffering, humans experiencing suffering, and more, seem like they make the total sign unclear. Under moral views that weigh suffering more highly than happiness, there’s an even stronger case for the current world being net-negative. This is one of those arguments that I think is pretty weird and almost never affects my actions, but it is relevant to the question of whether extra time for the human race is positive EV.
Third argument about how AI sooner could help reduce other existential risks. e.g., normal example of AI speeding up vaccine research, or weirder example of AI enabling space colonization, and being on many planets makes x-risk lower. I don’t personally put very much weight on this argument, but it’s worth mentioning.
Thanks Aaron appreciate the effort.
I Faild to point out my central assumpton here, that Transformative AI in our current state of poor preparedness is net negative due to the existential risk it entails.
Its a good point about time pre transformative AI not being so valuable in the grand scheme of the future, but that ev would increase substantally assuming transformative AI is the end.
Still looking for the fleshing out of this argument that I don’t understand—if anyone can be bothered!
”It seems to me the argument would have to be that the advantage to the safety work of improving capabilities would outstrip the increasing risk of dangerous GAI, which I find hard to get my head around, but I might be missing something important.”
What is your p(doom|AGI)? (Assuming AGI is developed in the next decade.)
Note that Bostrom himself says in Astronomical Waste (my emphasis in bold):
I don’t think you read my comment:
I also think it’s bad how you (and a bunch of other people on the internet) ask this p(doom) question in a way that (in my read of things) is trying to force somebody into a corner of agreeing with you. It doesn’t feel like good faith so much as bullying people into agreeing with you. But that’s just my read of things without much thought. At a gut level I expect we die, my from-the-arguments / inside view is something like 60%, and my “all things considered” view is more like 40% doom.
Wow that escalated quickly :(
It’s really not. I’m trying to understand where people are coming from. If someone has low p(doom|AGI), then it makes sense that they don’t see pausing AI development as urgent. Or their p(doom) relative to their actions can give some idea of how risk taking they are (but I still don’t understand how OpenAI and their supporters think it’s ok to gamble 100s of millions of lives in expectation for a shot at utopia without any democratic mandate).
and
Surely means that extra time now (pausing) is extremely valuable? i.e. because of its impact on existential risk.
Or do you think that the chance we’re in a net negative world now means that the astronomical future we could save would also most likely be net negative? I don’ think this follows. Or that continuing to allow AI to speed up now will actually prevent extinction threats in the next 10 years that we would otherwise be wiped out by (this seems very unlikely to me).
Sorry, I agree my previous comment was a bit intense. I think I wouldn’t get triggered if you instead asked “I wonder if a crux is that we disagree on the likelihood of existential catastrophe from AGI. I think it’s very likely (>50%), what do you think?”
P(doom) is not why I disagree with you. It feels a little like if I’m arguing with an environmentalist about recycling and they go “wow do you even care about the environment?” Sure, that could be a crux, but in this case it isn’t and the question is asked in a way that is trying to force me to agree with them. I think asking about AGI beliefs is much less bad, but it feels similar.
I think it’s pretty unclear if extra time now positively impacts existential risk. I wrote about a little bit of this here, and many others have discussed similar things. I expect this is the source of our disagreement, but I’m not sure.
I think one of the better write-ups about this perspective is Anthropic’s Core Views on AI Safety.
From its main text, under the heading The Role of Frontier Models in Empirical Safety, a couple relevant arguments are:
Many safety concerns arise with powerful systems, so we need to have powerful systems to experiment with
Many safety methods require large/powerful models
Need to understand how both problems and our fixes change with model scale (if model gets bigger, does it look like safety technique is still working)
To get evidence of powerful models being dangerous (which is important for many reasons), you need the powerful models.
Thanks Aaron that’s a good article appreciate it. It still wasn’t clear to me they were making an argument that increasing capabilities could be net positive, more that safety people should be working with whatever is the current most powerful model
”But we also cannot let excessive caution make it so that the most safety-conscious research efforts only ever engage with systems that are far behind the frontier.”
This makes sense to me, the best safety researchers should have full access to the current most advanced models, preferably in my eyes before they have been (fully) trained.
But then I don’t understand their next sentance “Navigating these tradeoffs responsibly is a balancing act, and these concerns are central to how we make strategic decisions as an organization.”
I’m clearly missing something, what’s the tradeoff? Is working on safety with the most advanced current model while generally slowing everything down not the best approach? This doesn’t seem like a tradeoff to me
How is there any net safety advantage in increasing AI capacity?
Anthropic[1] have a massive conflict of interest (making money), so their statements are in some sense safetywashing. There is at least a few years worth of safety work that can be done on current models if we had the time (i.e. via a pause): interpretability is still stuck on trying to decipher GPT-2 sized models and smaller. And jailbreaks are still very far from being solved. Plenty to be getting on with without pushing the frontier of capabilities yet further.
And the other big AI companies that supposedly care about x-safety (OpenAI, Google DeepMind)
The assumptions are that more powerful models won’t be like weaker models but more accurate. They will show emergent abilities. Many things that gpt-4 can solve gpt-3 cannot, and those models share a similar lineage.
Safety issues show up when you have a model powerful enough to even exhibit them, and they may not be anything you predicted will happen from theory. Waluigi effect, hallucinations—both were not predicted by any theory by AI safety research groups. They seem to be the majority of the issues with models at the current level of capabilities.
Free extra time is good. The reasonable version of the argument is that you should avoid buying total-time in ways that cost time with more powerful systems; maybe AI progress will look like the purple line.
Nice post. Why didn’t you post it here for AI pause debate week haha.
Yes I somewhat understand this potential “overhang” danger as an argument in and of itself against a pause., I just don’t see how it relates to technical alignment research specifically.
You could have stopped here. This is our crux.
I agree that the question of “what priors to use here” is super important.
For example, if someone would chose priors for “we usually don’t bring new more intelligent life forms to live with us, so the burden of proof is on doing so”—would that be valid?
Or if someone would say “we usually don’t enforce pauses on writing new computer programs”—would THAT be valid?
imo: the question of “what priors to use” is important and not trivial. I agree with @Holly_Elmore that just assuming the priors here is skipping over some important stuff. But I disagree that “you could have stopped here”, since there might be things which I could use to update my own (different) prior
*As far as my essay (not posted yet) was concerned, she could have stopped there, because this is our crux.
In a debate, which is what was supposed to be happening, the point is to make claims that either support or refute the central claim. That’s what Holly was pointing out—this is a fundamental requirement for accepting Nora’s position. (I don’t think that this is the only crux—“AI Safety is gonna be easy” and “AI is fully understandable” are two far larger cruxes, but they largely depend on this first one.)
Eliezer and Nate also both expect discontinuous Takeoff by default. I feel like it’s a bit disingenuous to argue that the thinking of Eliezer et al has proven obsolete and misguided, but then also quote them as apparent authority figures in this one case where their arguments align with your essay. It has to be one or the other!
Why does it have to be one or the other? I personally don’t put much stock in what Eliezer and Nate think, but many other people do.
Are you presenting arguments that you think will convince others, regardless of whether you think they are correct?Edit: Apologies, this doesn’t live up to my goals in having a conversation. However, I am concerned that quoting someone you think has non-predictive models of what will happen as an authority, without flagging that you’re quoting them to point out that your opposition grants that particular point, is disengenious.
If you don’t think their arguments are convincing, I consider it misleading to attempt to convince other people with those same arguments.
Unfortunately, this post got published under the wrong username. I’m the Nora who wrote this post. I hope it can be fixed soon.
Also near the end bullet 9 should be a subbullet of 8.
Yep, I was also hoping the images could be text-wrapped, but idk if this platform supports that.
Sorry about this, I believe it has now been fixed.
Upvoted. I don’t agree with all of these takes but they seem valuable and underappreciated.
I don’t think the comparison with human alignment being successful is fair.
If you mean that most people don’t go on to be antisocial etc.. which is comparable to non-X AI risk, the yes perhaps simple techniques like a ‘good upbringing’ are working on humans. A lot of it however is just baked in by evolution regardless. If you mean that most humans don’t go on to become X-risks, then that mostly has to do with lack of capability, rather than them being aligned. There are very few people I would trust with 1000x human abilities, assuming everyone else remains a 1x human.
I feel in a number of areas this post relies on the concept of AI being constructed/securitised in a number of ways that seem contradictory to me. (By constructed, I am referring to the way the technology is understood, percieved and anticipated, what narratives it fits into and how we understand it as a social object. By securitised, I mean brought into a limited policy discourse centred around national security that justifies the use of extraordinary measures (eg mass surveillance or conflict) to combat, concerned narrowly with combatting the existential threat to the state, which is roughly equal to the government, states territory and society. )
For example, you claim that hardware would be unlikely to be part of any pause effort, which would imply that AI is constructed to be important, but not necessarily exceptional (perhaps akin to climate change). This is also likely what would allow companies to easily relocate without major issues. You then claim it is likely international tensions and conflict would occur over the pause, which would imply thorough securitisation such that breaching a pause would be considered a threat enough to national security that conflict could be counternanced; therefore exceptional measures to combat the existential threat are entirely justified(perhaps akin to nuclear weapons or even more severe). Many of your claims of what is ‘likely’ seem to oscillate between these two conditions, which in a single juristiction seem unlikely to occur simultaeously. You then need a third construction of AI as a technology powerful and important enough to your country to risk conflict with the country that has thoroughly securitised it. SImilarly there must be elements in the paused country that are powerful that also believe it is a super important technology that can be very useful, despite its thorough securitisation (or because of it; I don’t wish to project securitisation as necessarily safe or good! Indeed, the links to military development, which could be facilitated by a pasue, may be very dangerous indeed.)
You may argue back two points; either that whilst all the points couldn’t occur simultanously, they are all pluasible. Here I agree, but then the confidence in your language would need to be toned down. Secondly that these different constructions of AI may differ across juristictions, meaning that all of these outcomes are likely. This also seems certainly unlikely, as countries are impacted by each other; narratives do spread, particularly in an interconnected world and particularly if they are held by powerful actors. Moreover, if powerful states are anywhere close to risking conflict over this, other economic or diplomatic measures, would be utilised first, likely meaning the only countries that would continue to develop it would be those who construct it as a super important (those who didn’t would likely give into the pressure). In a world where the US or China construct the AI Pause as a vital matter of national security, middle ground countries in their orbit allowing its development would not be counternanced.
I’m not saying a variety of constructions are not plausible. Nor am I saying that we necessarily fall to the extreme painted in the above paragraph (honestly this seems unlikely to me, but if we don’t then a Pause by global cooperation seems more plausible). Rather, I am suggesting that as it stands your idea of ‘likely outcomes’, are, together, very unlikely to happen, as they rely on different worlds to one another.
Addressing some of your objections:
Hardware development restriction would be nice, but it’s not necessary for a successful moratorium (at least for the next few years) given already proposed compute governance schemes. There are only a handful of large hardware manufacturers and data centre vendors who would need to be regulated into building in detection and remote kill switches into their products to ensure training runs over a certain threshold of compute aren’t completed. And training FLOP limits could be regularly ratcheted down to account for algorithmic improvements. (Eventually hardware development restrictions would come in once the FLOP limits threaten becoming too accessible/cheap to reach to be easily enforceable otherwise).
Not with an indefinite global pause that is only lifted following a global consensus on an alignment solution sufficient for x-safety (and this is the only kind of moratorium that is being seriously discussed as a solution to AI x-risk). I think a lot of objections to Pause are based on the idea that it would be of fixed time limit. This is obviously unrealistic—when has there ever been an international treaty or moratorium that had a fixed expiry date?
This does not seem very like how {nuclear, bio, chemical} weapons treaties or CFC or climate change treaties have gone.
One thing you haven’t factored is a taboo forming on AGI/ASI development that would accompany any Pause. This would overcome a lot of your objections / failure modes. Where are all the non-human-cloning-ban countries?
Thanks for this post Nora :) It’s well-written, well-argued, and has certainly provoked some lively discussion. (FWIW I welcome good posts like this that push back against certain parts of the ‘EA Orthodoxy’)[1]
My only specific comment would be similar to Daniel’s, I’m not sure the references to the CIRL paradigm being irrelevant are fully backed-up. Not saying that’s wrong, just that I didn’t find the links convincing (though I don’t work on training/aligning/interpreting LLMs as my day job)
My actual question is that I want there to be more things like this, bridging the gap between those who are concerned about xRisk (most EA responses), those who aren’t and optimistic about AI (which I’d roughly place you in), and those who aren’t and concerned about AI (The FAact/AI Ethics crowd). Do you think that there’s a way to do this productively, instead of people on all sides shouting at each other on Twitter constantly?
For some forum users, why are you downvoting the post? There are separate disagree votes available on top-level posts now
Downvoter here. The post is more than just wrong (worthy of a disagree vote). It’s substantially negative EV for the future of the world. Or, to put it bluntly, it’s significantly[1] increasing the risk that we all get killed in the next few years.
It’s dangerous because it sounds plausible (and indeed has been upvoted a bunch and is the second highest karma post in this debate series currently). But it contains a number of unjustified claims (see other comments, e.g. [1], [2], [3], [4]), and is framed from the perspective of AI x-risk not being a problem (there’s a reason Nora works at Eleuther rather than Conjecture). Right now, the EA community seems like it’s on the fence on the issue of an AGI moratorium (or slowing down AI in general). But there are signs that EAs are warming to the idea. I see this debate series as being high stakes in terms whether there will be significant EA resources directed toward pushing for a moratorium. Such resources could really make the difference between it happening or not (given how few resources are being directed toward it so far).
EDIT: I expected that this comment itself would be downvoted. Why are you downvoting the comment? [There are separate disagree votes available on comments now.]
1+ basis points?
Thanks for replying Greg. I have indeed upvoted/disagreevoted you here, because I really appreciate Forum voters explaining their reasoning even if I disagree.
Mainly, I think calling Nora’s post “substantially negative EV for the future of the world” is tending towards the ‘galaxy brain’ end of EA that puts people off. I can’t calculate that, and I think it’s much more plausible that it provides EA Forum with a well written and knowledgable perspective of someone who disagrees on alignment difficulty and whether a pause is the best policy.
It’s part of a debate series, so in my opinion it’s entirely fine for it to be Nora’s perspective. Her post is quite open that she thinks Alignment is going well, and I valued it a lot even if I disagreed with specific points in it. I don’t think Nora’s being intentionally wrong, those are just claims she believes that may turn out to be incorrect.
I recognise that you are a lot more concerned about AI x-risk than I am (not to say I’m not concerned though) and are a lot more sure about pursuing a moratorium. I suppose I’d caution against presupposing your conclusion is so correct that other views, such as Nora’s, don’t deserve a hearing in the public sphere. I think that’s a really dangerous line of thought to go down. I think this is a place where a moral uncertainty framework could mitigate this line of thought, without necessarily watering down your commitment to prevent AI xRisk.
I agree with this (apart from the “valued it a lot” part, and I think Nora is coming in with a pro-AI bias). I downvoted because I thought the karma total was (still is) way too high, and high karma posts and their headlines do, for better or worse, influence the community and how it directs its resources.
Again, it deserves a hearing. I’m upset by how highly upvoted it is. If it was on, say, 10 karma (on a similar number of votes), I wouldn’t’ve downvoted it any further[1].
[I also upvoted, disagreevoted your comment above :)]
It’s currently on 101 karma on 114 votes, which at least marks it out as somewhat controversial (I think <1 karma/vote is generally the sign of a controversial post on the EA Forum). Note for reference that my post from a few months ago, raising the alarm about very short term AGI x-risk, is on 66 karma from 100 votes. But I made the mistake of cross-posting it to LW (where people are generally allergic to any kind of political activism), which led to a bunch of people coming over from there and downvoting it here as well.
Enjoyed the post, thanks! But it starts with an invalid deduction:
(I added the emphasis)
Instead, it seems more reasonable to simply advocate for such action exactly if, in expectation, the benefits seem to [even just about] outweigh the costs. Of course, we have to take into account all types of costs, as you advocate in your post. Maybe that includes even some unknown unknowns in terms of risks from an imposed pause. Still, in the end, we should be even-handed. That we don’t impose pauses on most technologies, surely is not a strong reason to the contrary: We might (i) for bad reasons fail to impose pauses also in other cases, or, maybe more clearly, (ii) simply not see so many other technologies with so large potential downside warranting making pause a major need—after all, that’s why we have started the debate in particular about this new technology, AI.
This is just a point on stringency in your provided motivation for the work; changing that beginning of your article would IMHO avoid an unnecessary ‘tendentious’ passage.
I agree in theory, but disagree in practice. In theory, utilitarians only care about the costs and benefits of policy. But in practice, utilitarians should generally be constrained by heuristics and should be skeptical of relying heavily on explicit cost-benefit calculations.
Consider the following thought experiment:
You’re the leader of a nation and are currently deciding whether to censor a radical professor for speech considered perverse. You’re very confident that the professor’s views are meritless. You ask your advisor to run an analysis on the costs and benefits of censorship in this particular case, and they come back with a report concluding that there is slightly more social benefit from censoring the professor than harm. Should you censor the professor?
Personally, my first reaction would be to say that the analysis probably left out second order effects from censoring the professor. For example, if we censor the professor, there will be a chilling effect on other professors in the future, whose views might not be meritless. So, let’s make the dilemma a little harder. Let’s say the advisor insists they attempted to calculate second order effects. You check and can’t immediately find any flaws in their analysis. Now, should you censor the professor?
In these cases, I think it often makes sense to override cost-benefit calculations. The analysis only shows a slight net-benefit, and so unless we’re extremely confident in its methodology, it is reasonable to fall back on the general heuristic that professors shouldn’t be censored. (Which is not to say we should never violate the principle of freedom of speech. If we learned much more about the situation, we might eventually decide that the cost-benefit calculation was indeed correct.)
Likewise, I think it makes sense to have a general heuristic like, “We shouldn’t ban new technologies because of abstract arguments about their potential harm” and only override the heuristic because of strong evidence about the technology, or after very long examination, rather than after the benefits of a ban merely seem to barely outweigh the costs.
I have some sympathy with ‘a simple utilitarian CBA doesn’t suffice’ in general, but I do not end at your conclusion; your intuition pump also doesn’t lead me there.
It doesn’t seem to require any staunch utilitarianism to arrive at ‘if a quick look at the gun design suggests it has 51% to shoot in your own face, and only 49% to shoot at the tiger you want to hunt as you otherwise starve to death’*, to decide to drop the project of it’s development. Or, to halt, until a more detailed examination might allow you to update with a more precise understanding.
You mention that with AI we have ‘abstract arguments’, to which my gun’s simple failure probability may not do full justice. But I think not much changes, even if your skepticism about the gun would be as abstract or intangible as ‘err, somehow it just doesn’t seem quite right, I cannot even quite perfectly pin down why, but overall the design doesn’t make me trust; maybe it explodes in my hand, it burns me, it’s smoke might make me fall ill, whatever, I just don’t trust it; i really don’t know, but HAVING TAKEN ALL EVIDENCE AND LIVE EXPERIENCE, incl. the smartest EA and LW posts and all, I guess, 51% I get the harm, and only 49% the equivalent benefit, one way or another’ - as long as it’s still truly the best estimate you can do at the moment.
The (potential) fact that we more typically have found new technologies to advance us, does very little work in changing that conclusion, though, of course, in a complicated case as in AI, this observation itself may have informed some of our cost-benefit reflections.
*Yes you guessed correctly, I better implicitly assume something like, you have 50% of survival w/o catching the tiger, and 100% with him (and you only care about your survival) to really arrive at the intended ‘slightly negative in the cost-benefit comparison’; so take the thought experiment as an unnecessarily complicated quick and dirty one, but I think it still makes the simple point.
In my thought experiment, we generally have a moral and legal presumption against censorship, which I argued should weigh heavily in our decision-making. By contrast, in your thought experiment with the tiger, I see no salient reason for why we should have a presumption to shoot the tiger now rather than wait until we have more information. For that reason, I don’t think that your comment is responding to my argument about how we should weigh heuristics against simple cost-benefit analyses.
In the case of an AI pause, the current law is not consistent with a non-voluntary pause. Moreover, from an elementary moral perspective, inventing a new rule and forcing everyone to follow it generally requires some justification. There is no symmetry here between action vs. inaction as there would be in the case of deciding whether to shoot the tiger right now. If you don’t see why, consider whether you would have had a presumption against pausing just about any other technology, such as bicycles, until they were proven safe.
My point is not that AI is just as safe as bicycles, or that we should disregard cost-benefit analyses. Instead, I am trying to point out that cost-benefit analyses can often be flawed, and relying on heuristics is frequently highly rational even when they disagree with naive cost-benefit analyses.
I tried to account for the difficulty to pin down all relevant effects in our CBA by adding the somewhat intangible feeling about the gun to backfire (standing for your point that there may be more general/typical but less easy to quantify benefits of not censoring etc.). Sorry, if that was not clear.
More importantly:
I think your last paragraph gets to the essence: You’re afraid the cost-benefit analysis is done naively, potentially ignoring the good reasons for which we most often may not want to try to prevent the advancement of science/tech.
This does, however, not imply that for pausing we’d require Pause Benefit >> Pause Cost. Instead, it means, simply you’re wary of certain values for E[Pause Benefit] (or of E[Pause Cost]) to be potentially biased in a particular direction, so that you don’t trust in conclusions based on them. Of course, if we expect a particular bias of our benefit or our cost estimate, we cannot just use the wrong estimates.
When I’m advocating to be even-handed, I refer to a cost-benefit comparison that is non-naive. That is, if we have priors that there may exist positive effects that we’ve just not yet managed to pin down well, or to quantify, we have (i) used reasonable placeholders for these, avoiding bias as good as we can, and (ii) duly widened our uncertainty intervals. It is therefore, that in the end, we can remain even-handed, i.e. pause roughly iif E[Pause Benefit] > E[Pause Cost]. Or, if you like, iif E[Pause Benefit*] > E[Pause Cost*], with * = Accounting with all duty of care for the fact that you’d usually not want to stop your professor or so/usually not want to stop tech advancements because of yadayada..
As far as I am aware, no current AI system, LLM-based or otherwise, is anywhere near capable enough to act autonomously in sufficiently general real-world contexts, such that it actually poses any kind of threat to humans on its own (even evaluating frontier models for this possibility requires giving them a lot of help). That is where the extinction-level danger lies. It is (mostly) not about human misuse of AI systems, whether that misuse is intentional or adversarial (i.e. a human is deliberately trying to use the AI system to cause harm) or unintentional (i.e. the model is poorly trained or the system is buggy, resulting in harm that neither the user nor the AI system itself intended or wanted.)
I think there’s also a technical misunderstanding implied by this paragraph, of how the base model training process works and what the purpose of high-quality vs. diverse training material is. In particular, the primary purpose of removing “objectionable content” (and / or low-quality internet text) from the base model training process is to make the training process more efficient, and seems unlikely to accomplish anything alignment-relevant.
The reason is that the purpose of the base model training process is to build up a model which is capable of predicting the next token in a sequence of tokens which appears in the world somewhere, in full generality. A model which is actually human-level or smarter would (by definition) be capable of predicting, generating, and comprehending objectionable content, even if it had never seen such content during the training process. (See Is GPT-N bounded by human capabilities? No. for more.)
Using synthetic training data for the RLHF process is maybe more promising, but it depends on the degree to which RLHF works by imbuing the underlying model with the right values, vs. simply chiseling away all the bits of model that were capable of imagining and comprehending novel, unseen-in-training ideas in the first place (including objectionable ones, or ones we’d simply prefer the model not think about). Perhaps RLHF works more like the former mechanism, and as a result RLHF (or RLAIF) will “just work” as an alignment strategy, even as models scale to human-level and beyond.
Note that it is possible to gather evidence on this question as it applies to current systems, though I would caution against extrapolating such evidence very far. For example, are there any capabilities that a base model has before RLHF, which are not deliberately trained against during RHLF (e.g. generating objectionable content), which the final model is incapable of doing?
If, say, the RLHF process trains the model to refuse to generate sexually explicit content, and as a side effect, the RLHF’d model now does worse on answering questions about anatomy compared to the base model, that would be evidence that the RLHF process simply chiseled away the model’s ability to comprehend important parts of the universe entirely, rather than imbuing it with a value against answering certain kinds of questions as intended.
I don’t actually know how this particular experimental result would turn out, but either way, I wouldn’t expect any trends or rules that apply to current AI systems to continue applying as those systems scale to human-level intelligence or above.
For my own part, I would like to see a pause on all kinds of AI capabilities research and hardware progress, at least until AI researchers are less confused about a lot of topics like this. As for how realistic that proposal is, whether it likely constitutes a rather permanent pause, or what the consequences of trying and failing to implement such a pause would be, I make no comment, other than to say that sometimes the universe presents you with an unfair, impossible problem.
Nora, what is your p(doom|AGI)?
I think this is a crux. GPT-4 is only safe because it is weak. It is so far from being 100% aligned—see e.g this boast from OpenAI that is very far from being reassuring (“29% more often”), or all the many many jailbreaks—which is what will be needed for us to survive in the limit of superintelligence!
You go on to talk about robustness (to misuse) and how this (jailbreaks) is is a separate issue, but whilst the distinction may be important from the perspective of ML research (or AI capabilities research), the bottom line, ultimately, for all of us, is existential safety (x-safety).
I’ve folded all of the ways things could go wrong in terms of x-safety into my my concept of alignment here[1]. Solving misuse (i.e. jailbreakes) is very much part of this! If we don’t, in the limit of superintelligence, all it takes is one bad actor directing their (to them “aligned”, by your definition) AI toward wiping out humanity, and we’re all dead (and yes, there are people who would press such a button if they had access to one).
Perhaps it would just be better referred to as x-safety.
I think this post provides some pretty useful arguments about the downsides of pausing AI development. I feel noticeably more pessimistic about a pause going well having read this.
However, I don’t agree with some of the arguments about alignment optimism and think they’re a fair bit weaker
Sure, we can use RLHF/related techniques to steer AI behavior. Further,
Sure, unlike in most cases in biology, ANN updates do act on the whole model without noise etc.
But there are worries about what happens when AIs get predictably harder to evaluate as they reach superhuman performance on more tasks that are still very real given all of this! You mention scalable oversight research so it’s clear you are aware that this is an open problem, but I don’t think this post emphasises enough how most alignment work recognises a pretty big difference between aligning subhuman systems and superhuman systems, which limits the optimism you can get from GPT-4 seeming basically aligned. I think it’s possible that with tons of compute and aligned weaker AIs (as you touch upon) we can generalize to aligned GPT-5, GPT-6 etc. But this feels like a pretty different paradigm to the various analogies to the natural world and the current state of alignment!
Good post.
Small things:
You don’t actually discuss concentrating power, I think. (You just say fast takeoff is bad because it makes alignment harder, which is the same as your 1.)
Two clarifications (I know you know—for the others’ benefit):
(a) Software progress includes training-time and inference-time improvements like better prompting or agent scaffolding. You’re considering a pause on training-time improvements. “Existing models will continue to get more powerful” as inference-time compute and software improve.
(b) Failing to pause hardware R&D may create training-time compute overhang. I agree with you that existing models will be able to leverage better hardware at inference time, so it probably doesn’t create a big inference-time compute overhang. So failing to pause is not “a serious problem” in the context of inference-time compute, I think.
Superintelligence describes exploiting hard-coded goals as one failure mode which we would probably now call specification gaming. But the book is quite comprehensive, other failure modes are described and I think the book is still relevant.
For example, the book describes what we would now call deceptive alignment:
And reward tampering:
And reward hacking:
I don’t think incorrigibility due to the ‘goal-content integrity’ instrumental goal has been observed in current ML systems yet but it could happen given the robust theoretical argument behind it:
One of my favorite passages is your remark on AI in some ways being rather more white-boxy, while instead humans are rather black boxy and difficult to align. Some often ignored truth in that (even if, in the end, what really matters, arguably is that we’re so familiar with human behavior, that overall, the black boxy-ness of our inner workings may matter less).